JP2008204193A

JP2008204193A - Content retrieval/recommendation method, content retrieval/recommendation device, and content retrieval/recommendation program

Info

Publication number: JP2008204193A
Application number: JP2007039945A
Authority: JP
Inventors: Takeshi Irie; 豪入江; Kota Hidaka; 浩太日高; Takashi Sato; 隆佐藤; Yukinobu Taniguchi; 行信谷口; Shinya Nakajima; 信弥中嶌
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-02-20
Filing date: 2007-02-20
Publication date: 2008-09-04
Anticipated expiration: 2027-02-20
Also published as: JP4891802B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a content retrieval/recommendation device, capable of performing retrieval/recommendation of contents according to a user's emotion. <P>SOLUTION: The device comprises an emotion estimation part F100 which estimates emotion and degree of emotion for a multimedia content from voice signal data contained in the multimedia content; a content accumulation part F200 which accumulates contents with the emotion and degree of emotion estimated by the estimation part F100 as metadata; a retrieval request receiving part F300 which receives a retrieval request corresponding to the emotion or the emotion and the degree of emotion; a similarity calculation part F400 which calculates, based on the retrieval request, a similarity of the contents or partial contents; and a result presentation part F500 which presents, based on the similarity, a retrieval/recommendation result of the contents or partial contents. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、コンテンツ又はその部分コンテンツの感情、及びその感情の強さを表す感情度を推定し、これに基づいてコンテンツの検索・推薦を行うコンテンツ検索・推薦方法、コンテンツ検索・推薦装置、およびコンテンツ検索・推薦プログラムに関する。この発明において、コンテンツとは、映像・音声コンテンツを指すものとし、本発明における音声とは、人間による発話音声のみではなく、歌唱音声、音楽、環境音なども含むものとする。また、感情とは、感情や情動、気分などの心理的状態の他、雰囲気、印象なども含むものとする。 The present invention estimates a feeling of content or a partial content thereof, and an emotion level indicating the strength of the emotion, and a content search / recommendation method, a content search / recommendation device for searching / recommending content based on this, and Related to content search and recommendation program. In this invention, the content refers to video / audio content, and the audio in the present invention includes not only human speech but also singing audio, music, environmental sound, and the like. In addition, emotions include emotions, emotions, and psychological states such as moods, as well as atmospheres and impressions.

現在、放送に限らず、Ｗｅｂサイトや個人ＰＣにおいても、コンテンツを視聴することが増えてきている。コンテンツの種類も、例えば、映画やドラマ、ホームビデオ、ニュース、ドキュメンタリ、音楽など、非常に多様化している。 Currently, viewing of content is increasing not only in broadcasting but also on websites and personal PCs. The types of contents are also very diversified, such as movies, dramas, home videos, news, documentaries, music, and so on.

これに伴いユーザにとっては、多様なコンテンツの中から嗜好に合ったコンテンツを効率的に発見することが困難になるという問題が生じる。特に、コンテンツは、自分の嗜好に合ったものであるかどうかを確認するために、実際に視聴して内容を把握する必要があるが、これを実行するためには費やす時間コストが非常に大きくなってしまうため、コンテンツ検索技術、更には、嗜好に沿ったコンテンツを自動的に推薦するコンテンツ推薦技術が不可欠となる。更に、同様の理由から、検索実行時に参照するメタデータ等の、コンテンツに係る情報も、人手によらず、自動的に付与されることが望ましい。 As a result, there arises a problem that it becomes difficult for the user to efficiently find content that suits his taste from various contents. In particular, it is necessary to actually view and understand the content in order to confirm whether or not the content is suitable for one's preference, but the time cost to perform this is very large. For this reason, a content search technique and a content recommendation technique for automatically recommending content in accordance with the preference are indispensable. Furthermore, for the same reason, it is desirable that information related to content, such as metadata to be referred to at the time of search execution, is automatically given regardless of the manual operation.

コンテンツ検索に関連する従来技術として、下記特許文献１に記載の方法がある。この特許文献１では、被写体の大きさや色などの物理量を、“大・中・小”や、“赤・黄・緑”などに分類することで単語化し、この単語に基づいた検索を実行する方法について開示されている。 As a conventional technique related to content search, there is a method described in Patent Document 1 below. In Patent Document 1, physical quantities such as the size and color of a subject are classified into “large / medium / small”, “red / yellow / green”, and the like, and a search based on the words is executed. A method is disclosed.

尚、本発明に関連する、基本周波数、パワーの抽出方法については下記非特許文献１に記載され、音声速度については下記非特許文献２に記載され、音声モデルのパラメータの推定方法は下記非特許文献３，４に記載され、自然言語処理については下記非特許文献５，６、特許文献２に記載され、映像特徴量の抽出については下記非特許文献７、特許文献３，４，５に記載され、オプティカルフローの計算方法については下記非特許文献８に記載されている。
特開平５−２８２３８０号公報特開平５−７３３１７号公報特許第３４０８１１７号特開２００５−１５７９１１号特許第３０９８２７６号「ディジタル音声処理第４章４．９ピッチ抽出」、古井貞熙、東海大学出版会、ｐｐ．５７−５９、１９８５年９月「音声の動的尺度に含まれる個人性情報」、嵯峨山茂樹、板倉文忠、日本音響学会昭和５４年度春季研究発表会講演論文集、３−２−７、ｐｐ．５８９−５９０、１９７９年「わかりやすいパターン認識」、石井健一郎、上田修功、前田栄作、村瀬洋、オーム社、ｐｐ５２−５４、１９９８年「計算統計Ｉ第ＩＩＩ章３ＥＭ法４変分ベイズ法」、上田修功、岩波書店、ｐｐ．１５７−１８６、２００３年６月「日本語語彙大系」、ＮＴＴコミュニケーション科学研究所監修、池原悟、宮崎正弘、白井論、横尾昭男、中岩浩巳、小倉健太郎、大山芳史、林良彦編集、岩波書店、１９９７年「自然言語処理の基礎技術」、電子情報通信学会、コロナ社、１９８８年３月「映像特徴インデクシングに基づく構造化映像ハンドリング機構と映像利用インタフェースに関する研究第３章画像処理に基づく映像インデクシング」、外村佳伸、京都大学博士論文、ｐｐ．１５−２３、２００６「コンピュータ画像処理」、田村秀行編著、オーム社、ｐｐ．２４２−２４７、２００２年１２月 Note that the fundamental frequency and power extraction methods related to the present invention are described in Non-Patent Document 1 below, the speech speed is described in Non-Patent Document 2 below, and the speech model parameter estimation method is described below. Documents 3 and 4 describe natural language processing in Non-Patent Documents 5 and 6 and Patent Document 2 below. Extraction of video feature values is described in Non-Patent Document 7 and Patent Documents 3, 4 and 5 below. The optical flow calculation method is described in Non-Patent Document 8 below.
Japanese Patent Laid-Open No. 5-282380 Japanese Patent Laid-Open No. 5-73317 Japanese Patent No. 3408117 JP 2005-157911 A Patent No. 3098276 “Digital Speech Processing, Chapter 4, 4.9 Pitch Extraction”, Sadaaki Furui, Tokai University Press, pp. 57-59, September 1985 “Personality information included in dynamic scale of speech”, Shigeki Hiyama, Fumitada Itakura, Proceedings of the Spring Society of Japan 1979 Spring Research Presentation, 3-2-7, pp. 589-590, 1979 “Easy-to-understand pattern recognition”, Kenichiro Ishii, Noriyoshi Ueda, Eisaku Maeda, Hiroshi Murase, Ohmsha, pp 52-54, 1998 “Computational Statistics I Chapter 3 3EM Method 4 Variation Bayes Method”, Nobuyoshi Ueda, Iwanami Shoten, pp. 157-186, June 2003 "Japanese Vocabulary University", supervised by NTT Communication Science Laboratories, Satoru Ikehara, Masahiro Miyazaki, Raku Shirai, Akio Yokoo, Hiroaki Nakaiwa, Kentaro Ogura, Yoshifumi Oyama, Yoshihiko Hayashi, Iwanami Shoten, 1997 "Basic Technology of Natural Language Processing", IEICE, Corona, March 1988 “Research on structured video handling mechanism and video interface based on video feature indexing, Chapter 3 Video indexing based on image processing”, Yoshinobu Tonomura, Ph.D. 15-23, 2006 "Computer image processing", edited by Hideyuki Tamura, Ohmsha, pp. 242-247, December 2002

従来の方法は、映像の物理的特徴に基づいた類似検索を実現している。しかし、物理的特徴とユーザの嗜好とは直接結びつくものではないため、ユーザの嗜好を反映した検索・推薦は実行できなかった。 The conventional method realizes a similarity search based on the physical characteristics of video. However, since the physical characteristics and the user's preference are not directly connected, search / recommendation reflecting the user's preference cannot be executed.

また、コンテンツを視聴する際には、各ユーザの嗜好が重要であるが、同様に、映像視聴時点でのユーザの感情も非常に重要である。ユーザの視聴したいコンテンツは常に同じではなく、ユーザの感情に応じてこれも動的に変化するためである。しかし、従来の方法ではユーザの感情に応じた検索・推薦を実行することはできなかった。 In addition, when viewing content, the preference of each user is important, but the emotion of the user at the time of viewing the video is also very important. This is because the content that the user wants to watch is not always the same, and this also changes dynamically according to the user's emotions. However, the conventional method cannot execute search / recommendation according to the user's emotion.

本発明は上記の点に鑑みてなされたものでその目的は、ユーザの感情に応じたコンテンツの検索・推薦が行えるコンテンツ検索・推薦方法、コンテンツ検索・推薦装置およびコンテンツ検索・推薦プログラムを提供することにある。 The present invention has been made in view of the above points, and an object of the present invention is to provide a content search / recommendation method, a content search / recommendation device, and a content search / recommendation program that can search and recommend content according to the user's emotions. There is.

本発明は、コンテンツを分析することでコンテンツに対して自動的にその感情及び感情度を推定し、これをメタデータとしてコンテンツに自動付与する機能を備える。このメタデータに基づいて検索・推薦を実行することで、ユーザから入力される、感情を反映した検索要求に適したコンテンツ検索・推薦を実現する。 The present invention has a function of automatically estimating the emotion and emotion level of the content by analyzing the content, and automatically assigning this to the content as metadata. By executing search / recommendation based on this metadata, content search / recommendation suitable for a search request reflecting emotions input from the user is realized.

すなわち、本発明の請求項１に記載のコンテンツ検索・推薦方法は、感情推定手段が、マルチメディアコンテンツに含まれる音声信号データから、コンテンツの感情及び感情度を推定する感情推定ステップと、コンテンツ蓄積手段が、前記感情推定手段によって推定された前記感情と前記感情度をメタデータとして備えたコンテンツを蓄積したコンテンツ蓄積ステップと、検索要求受付手段が、前記感情又は前記感情と前記感情度に対応する検索要求を受け付ける検索要求受付ステップと、類似度計算手段が、前記検索要求に基づいて、前記コンテンツ又は部分コンテンツの類似度を算出する類似度計算ステップと、結果提示手段が、前記類似度に基づいて、コンテンツ又は部分コンテンツの検索・推薦結果を提示する結果提示ステップと、を含むことを特徴としている。 That is, in the content search / recommendation method according to claim 1 of the present invention, the emotion estimation means estimates the emotion and the emotion level of the content from the audio signal data included in the multimedia content, and the content accumulation A content storage step in which means stores content including the emotion estimated by the emotion estimation unit and the emotion level as metadata, and a search request reception unit corresponds to the emotion or the emotion and the emotion level. A search request accepting step for accepting a search request, a similarity calculating means for calculating a similarity of the content or partial content based on the search request, and a result presenting means based on the similarity A result presentation step for presenting search / recommendation results of content or partial content; It is characterized in that it comprises.

また請求項７に記載のコンテンツ検索・推薦装置は、マルチメディアコンテンツに含まれる音声信号データから、コンテンツの感情及び感情度を推定する感情推定手段と、前記感情推定手段によって推定された前記感情と前記感情度をメタデータとして備えたコンテンツを蓄積したコンテンツ蓄積手段と、前記感情又は前記感情と前記感情度に対応する検索要求を受け付ける検索要求受付手段と、前記検索要求に基づいて、前記コンテンツ又は部分コンテンツの類似度を算出する類似度計算手段と、前記類似度に基づいて、コンテンツ又は部分コンテンツの検索・推薦結果を提示する結果提示手段と、を含むことを特徴としている。 In addition, the content search / recommendation device according to claim 7 includes: an emotion estimation unit that estimates an emotion and an emotion level of content from audio signal data included in multimedia content; and the emotion estimated by the emotion estimation unit. Content accumulation means for accumulating content having the emotion level as metadata, search request reception means for receiving a search request corresponding to the emotion or the emotion and the emotion level, based on the search request, the content or It includes a similarity calculation unit that calculates the similarity of partial content, and a result presentation unit that presents a search / recommendation result of the content or partial content based on the similarity.

上記構成により、コンテンツ中に含まれる、音声信号データを分析し、その感情を抽出することで、コンテンツの感情及び感情度についてのメタデータを自動生成し、コンテンツに付与することが可能となり、コンテンツの感情と感情度に基づいて、ユーザの感情に応じたコンテンツの検索・推薦が可能となる。 With the above configuration, by analyzing the audio signal data included in the content and extracting the emotion, it is possible to automatically generate metadata about the emotion and emotion level of the content and attach it to the content. The content can be searched and recommended according to the user's emotion based on the emotion and the emotion level.

また請求項２に記載のコンテンツ検索・推薦方法は、請求項１に記載の方法において、前記感情推定ステップは、分析フレーム毎に、基本周波数、基本周波数の時間変動特性、振幅のｒｍｓ値、振幅のｒｍｓ値の時間変動特性、パワー、パワーの時間変動特性、音声速度、音声速度の時間変動特性の少なくとも１つを音声特徴量として抽出する特徴量抽出ステップと、学習用音声信号データを用いて予め構成された１つ以上の統計モデルによって、前記感情における前記音声特徴量の出現確率と、前記感情に対応する１つ以上の状態の時間方向への遷移確率のうち、少なくとも何れか１つに基づいて感情確率を計算する感情確率計算ステップと、前記感情確率に基づいて、１つ以上の前記分析フレームを含む前記部分コンテンツの前記感情度を計算する感情度計算ステップと、前記感情確率に基づいて、１つ以上の前記分析フレームによって構成される前記部分コンテンツの前記感情を判定する感情判定ステップと、を含むことを特徴としている。 The content search / recommendation method according to claim 2 is the method according to claim 1, wherein the emotion estimation step includes, for each analysis frame, a fundamental frequency, a temporal variation characteristic of the fundamental frequency, an rms value of amplitude, and an amplitude. A feature amount extraction step for extracting at least one of a time variation characteristic of rms value, power, time variation characteristic of power, voice speed, and time variation characteristic of voice speed as a voice feature quantity, and learning voice signal data. According to one or more statistical models configured in advance, at least one of the appearance probability of the voice feature amount in the emotion and the transition probability in the time direction of one or more states corresponding to the emotion An emotion probability calculation step for calculating an emotion probability based on the emotion probability of the partial content including one or more analysis frames based on the emotion probability And emotions calculation step of calculating, based on said emotion probability is characterized in that it comprises, emotion determination step of determining the emotion of the partial content composed of one or more of the analysis frame.

また請求項８に記載のコンテンツ検索・推薦装置は、請求項７に記載の装置において、前記感情推定手段は、分析フレーム毎に、基本周波数、基本周波数の時間変動特性、振幅のｒｍｓ値、振幅のｒｍｓ値の時間変動特性、パワー、パワーの時間変動特性、音声速度、音声速度の時間変動特性の少なくとも１つを音声特徴量として抽出する特徴量抽出手段と、学習用音声信号データを用いて予め構成された１つ以上の統計モデルによって、前記感情における前記音声特徴量の出現確率と、前記感情に対応する１つ以上の状態の時間方向への遷移確率のうち、少なくとも何れか１つに基づいて感情確率を計算する感情確率計算手段と、前記感情確率に基づいて、１つ以上の前記分析フレームを含む前記部分コンテンツの前記感情度を計算する感情度計算手段と、記感情確率に基づいて、１つ以上の前記分析フレームによって構成される前記部分コンテンツの前記感情を判定する感情判定手段と、を含むことを特徴としている。 The content search / recommendation device according to claim 8 is the device according to claim 7, wherein the emotion estimation means includes, for each analysis frame, a fundamental frequency, a time variation characteristic of the fundamental frequency, an rms value of amplitude, and an amplitude. A feature amount extracting means for extracting at least one of a time variation characteristic of rms value, power, time variation characteristic of power, voice speed, time variation characteristic of voice speed as a voice feature quantity, and learning voice signal data. According to one or more statistical models configured in advance, at least one of the appearance probability of the voice feature amount in the emotion and the transition probability in the time direction of one or more states corresponding to the emotion An emotion probability calculating means for calculating an emotion probability based on the emotion probability, and a feeling for calculating the emotion degree of the partial content including one or more analysis frames based on the emotion probability. A degree calculating unit, based on the serial emotion probability, is characterized by comprising a determining emotion determination means the feeling of the partial content composed of one or more of the analysis frame.

上記構成により、感情、感情度を推定する上で重要となる音声特徴量を抽出し、更に確率的推定を実行することで、多様なコンテンツの音源要因に係らず、安定かつ高精度に感情、感情度を推定できる。 With the above configuration, voice features that are important for estimating emotions and emotion levels are extracted, and by performing probabilistic estimation, emotions can be stably and highly accurate regardless of the sound source factors of various contents. Emotion level can be estimated.

また請求項３に記載のコンテンツ検索・推薦方法は、感情推定手段が、マルチメディアコンテンツに含まれる音声信号データ及び映像信号データから、コンテンツの感情及び感情度を推定する感情推定ステップと、コンテンツ蓄積手段が、前記感情推定手段によって推定された前記感情と前記感情度をメタデータとして備えたコンテンツを蓄積したコンテンツ蓄積ステップと、検索要求受付手段が、前記感情又は前記感情と前記感情度に対応する検索要求を受け付ける検索要求受付ステップと、類似度計算手段が、前記検索要求に基づいて、前記コンテンツ又は部分コンテンツの類似度を算出する類似度計算ステップと、結果提示手段が、前記類似度に基づいて、コンテンツ又は部分コンテンツの検索・推薦結果を提示する結果提示ステップと、を含むことを特徴としている。 The content search / recommendation method according to claim 3 includes an emotion estimation step in which the emotion estimation means estimates the emotion and the emotion level of the content from the audio signal data and the video signal data included in the multimedia content, and the content accumulation. A content storage step in which means stores content including the emotion estimated by the emotion estimation unit and the emotion level as metadata, and a search request reception unit corresponds to the emotion or the emotion and the emotion level. A search request accepting step for accepting a search request, a similarity calculating means for calculating a similarity of the content or partial content based on the search request, and a result presenting means based on the similarity Result presentation step for presenting search / recommendation results for content or partial content It is characterized in that it comprises a.

また請求項９に記載のコンテンツ検索・推薦装置は、マルチメディアコンテンツに含まれる音声信号データ及び映像信号データから、コンテンツの感情及び感情度を推定する感情推定手段と、前記感情推定手段によって推定された前記感情と前記感情度をメタデータとして備えたコンテンツを蓄積したコンテンツ蓄積手段と、前記感情又は前記感情と前記感情度に対応する検索要求を受け付ける検索要求受付手段と、前記検索要求に基づいて、前記コンテンツ又は部分コンテンツの類似度を算出する類似度計算手段と、前記類似度に基づいて、コンテンツ又は部分コンテンツの検索・推薦結果を提示する結果提示手段と、を含むことを特徴としている。 The content search / recommendation device according to claim 9 is estimated by the emotion estimation means for estimating the emotion and the emotion level of the content from the audio signal data and the video signal data included in the multimedia content, and the emotion estimation means. Based on the search request, a content storage unit that stores content including the emotion and the emotion level as metadata, a search request reception unit that receives a search request corresponding to the emotion or the emotion and the emotion level, and And a similarity calculation means for calculating the similarity of the content or the partial content, and a result presentation means for presenting a search / recommendation result of the content or the partial content based on the similarity.

上記構成により、コンテンツ中に含まれる音声信号データに加えて、映像信号データを分析することで、コンテンツの感情、感情度の推定精度をより高めることができる。 With the configuration described above, it is possible to further improve the estimation accuracy of the emotion and the emotion level of the content by analyzing the video signal data in addition to the audio signal data included in the content.

また請求項４に記載のコンテンツ検索・推薦方法は、請求項３に記載の方法において、前記感情推定ステップは、分析フレーム毎に、基本周波数、基本周波数の時間変動特性、振幅のｒｍｓ値、振幅のｒｍｓ値の時間変動特性、パワー、パワーの時間変動特性、音声速度、音声速度の時間変動特性の少なくとも１つ、及び、映像信号データから、ショット長、色ヒストグラム、色ヒストグラムの時間変動特性、動きベクトルの少なくとも１つを映像特徴量として抽出する特徴量抽出ステップと、学習用音声信号データを用いて予め構成された１つ以上の統計モデルと、学習用映像信号データを用いて予め構成された１つ以上の統計モデルとによって、前記感情における前記音声特徴量の出現確率と、前記感情に対応する１つ以上の状態の時間方向への遷移確率のうち、少なくとも何れか１つに基づいて感情確率を計算する感情確率計算ステップと、前記感情確率に基づいて、１つ以上の前記分析フレームを含む前記部分コンテンツの前記感情度を計算する感情度計算ステップと、前記感情確率に基づいて、１つ以上の前記分析フレームによって構成される前記部分コンテンツの前記感情を判定する感情判定ステップと、
を含むことを特徴としている。 The content search / recommendation method according to claim 4 is the method according to claim 3, wherein the emotion estimation step includes, for each analysis frame, a fundamental frequency, a temporal variation characteristic of the fundamental frequency, an rms value of amplitude, and an amplitude. Rms value time variation characteristic, power, time variation characteristic of power, audio speed, time variation characteristic of audio speed, and video signal data, shot length, color histogram, time variation characteristic of color histogram, A feature extraction step for extracting at least one of the motion vectors as a video feature, one or more statistical models pre-configured using the learning audio signal data, and pre-configuration using the learning video signal data One or more statistical models, and the appearance probability of the voice feature quantity in the emotion and the time of one or more states corresponding to the emotion An emotion probability calculation step for calculating an emotion probability based on at least one of the transition probabilities toward the direction, and the emotion level of the partial content including one or more analysis frames based on the emotion probability An emotion level calculating step for calculating the emotion level, and an emotion determination step for determining the emotion of the partial content configured by one or more analysis frames based on the emotion probability,
It is characterized by including.

また請求項１０に記載のコンテンツ検索・推薦装置は、請求項９に記載の装置において、前記感情推定手段は、分析フレーム毎に、基本周波数、基本周波数の時間変動特性、振幅のｒｍｓ値、振幅のｒｍｓ値の時間変動特性、パワー、パワーの時間変動特性、音声速度、音声速度の時間変動特性の少なくとも１つ、及び、映像信号データから、ショット長、色ヒストグラム、色ヒストグラムの時間変動特性、動きベクトルの少なくとも１つを映像特徴量として抽出する特徴量抽出手段と、学習用音声信号データを用いて予め構成された１つ以上の統計モデルと、学習用映像信号データを用いて予め構成された１つ以上の統計モデルとによって、前記感情における前記音声特徴量の出現確率と、前記感情に対応する１つ以上の状態の時間方向への遷移確率のうち、少なくとも何れか１つに基づいて感情確率を計算する感情確率計算手段と、前記感情確率に基づいて、１つ以上の前記分析フレームを含む前記部分コンテンツの前記感情度を計算する感情度計算手段と、前記感情確率に基づいて、１つ以上の前記分析フレームによって構成される前記部分コンテンツの前記感情を判定する感情判定手段と、を含むことを特徴としている。 The content search / recommendation device according to claim 10 is the device according to claim 9, wherein the emotion estimation means includes, for each analysis frame, a fundamental frequency, a time variation characteristic of the fundamental frequency, an rms value of amplitude, and an amplitude. Rms value time variation characteristic, power, time variation characteristic of power, audio speed, time variation characteristic of audio speed, and video signal data, shot length, color histogram, time variation characteristic of color histogram, Feature amount extracting means for extracting at least one motion vector as a video feature amount, one or more statistical models preconfigured using learning audio signal data, and preconfigured using learning video signal data The probability of appearance of the voice feature quantity in the emotion and the time direction of one or more states corresponding to the emotion by one or more statistical models. Emotion probability calculation means for calculating an emotion probability based on at least one of transition probabilities, and calculating the emotion level of the partial content including one or more analysis frames based on the emotion probability It includes an emotion level calculation means and an emotion determination means for determining the emotion of the partial content constituted by one or more analysis frames based on the emotion probability.

上記構成により、感情、感情度を推定する上で重要となる音声特徴量、映像特徴量を抽出し、更に確率的推定を実行することで、多様なコンテンツの音源要因、撮像状況に係らずより安定に精度よく感情、感情度を推定できる。 With the above configuration, voice features and video features that are important in estimating emotions and emotion levels are extracted, and further probabilistic estimation is performed, regardless of the sound source factors and imaging conditions of various contents. Emotion and emotion level can be estimated stably and accurately.

また請求項５に記載のコンテンツ検索・推薦方法は、請求項１乃至４の何れか１項に記載の方法において、前記検索要求受付ステップは、ユーザが視聴している及び／又は視聴した、コンテンツ又は部分コンテンツの前記感情、又は前記感情と前記感情度を参照し、これに基づいて決定された前記検索要求を受け付けることを特徴としている。 The content search / recommendation method according to claim 5 is the method according to any one of claims 1 to 4, wherein the search request reception step is a content that the user is viewing and / or viewing. Alternatively, the search request determined based on the emotion of the partial content or the emotion and the emotion level is received.

また請求項１１に記載のコンテンツ検索・推薦装置は、請求項７乃至１０の何れか１項に記載の装置において、前記検索要求受付手段は、ユーザが視聴している及び／又は視聴した、コンテンツ又は部分コンテンツの前記感情、又は前記感情と前記感情度を参照し、これに基づいて決定された前記検索要求を受け付けることを特徴としている。 The content search / recommendation device according to claim 11 is the device according to any one of claims 7 to 10, wherein the search request reception unit is a content that the user is watching and / or watching. Alternatively, the search request determined based on the emotion of the partial content or the emotion and the emotion level is received.

上記構成により、ユーザが現在視聴している、あるいは過去に視聴したコンテンツを手がかりとして、ユーザに検索要求を要請することなく、ユーザの嗜好に合った感情、感情度を持つコンテンツの検索・推薦が可能となる。 With the above configuration, it is possible to search / recommend content with emotions and emotional levels that match the user's preference without requesting the user to make a search request based on the content that the user is currently viewing or viewed in the past. It becomes possible.

また請求項６に記載のコンテンツ検索・推薦方法は、請求項１乃至５の何れか１項に記載の方法において、前記結果提示ステップは、前記類似度に基づいてコンテンツ又は部分コンテンツをランキングし、このランキング結果に基づいてコンテンツ又は部分コンテンツの属性情報、前記感情、前記感情度、サムネイル、要約コンテンツのうち少なくとも１つをリスト化して提示することを特徴としている。 The content search / recommendation method according to claim 6 is the method according to any one of claims 1 to 5, wherein the result presentation step ranks content or partial content based on the similarity, Based on the ranking result, at least one of the attribute information of the content or partial content, the emotion, the emotion level, the thumbnail, and the summary content is listed and presented.

また請求項１２に記載のコンテンツ検索・推薦装置は、請求項７乃至１１の何れかに記載の装置において、前記結果提示手段は、前記類似度に基づいてコンテンツ又は部分コンテンツをランキングし、このランキング結果に基づいてコンテンツ又は部分コンテンツの属性情報、前記感情、前記感情度、サムネイル、要約コンテンツのうち少なくとも１つをリスト化して提示することを特徴としている。 The content search / recommendation device according to claim 12 is the device according to any one of claims 7 to 11, wherein the result presentation unit ranks content or partial content based on the similarity, and the ranking is performed. Based on the result, at least one of the attribute information of the content or partial content, the emotion, the emotion level, the thumbnail, and the summary content is listed and presented.

上記構成により、類似度順に提示するなど、従来の検索・推薦方法に加え、更に、コンテンツのタイトルや、コンテンツの属性情報、感情、感情度、サムネイル、部分コンテンツを要約コンテンツとして提示するなどの結果も合わせて表示することで、ユーザのコンテンツの内容の理解を促進することができ、よりユーザの要求に合致したコンテンツの検索・推薦が可能となる。 In addition to the conventional search / recommendation method, such as presenting in the order of similarity according to the above configuration, results such as presenting the content title, content attribute information, emotion, emotion level, thumbnail, and partial content as summary content By displaying together, it is possible to promote understanding of the content of the user, and it is possible to search and recommend content that matches the user's request.

また請求項１３に記載のコンテンツ検索・推薦プログラムは、請求項１乃至６の何れか１項に記載のコンテンツ検索・推薦方法の各ステップをコンピュータに実行させるためのプログラムとしたことを特徴としている。 The content search / recommendation program according to claim 13 is a program for causing a computer to execute each step of the content search / recommendation method according to any one of claims 1 to 6. .

上記構成により、本発明による方法をコンピュータによって実行することができる。 With the above configuration, the method according to the present invention can be executed by a computer.

（１）請求項１，７に記載の発明によれば、コンテンツ中に含まれる、音声信号データを分析し、その感情を抽出することで、コンテンツの感情及び感情度についてのメタデータを自動生成し、コンテンツに付与することが可能となり、コンテンツの感情と感情度に基づいて、ユーザの感情に応じたコンテンツの検索・推薦が可能となる。
（２）請求項２，８に記載の発明によれば、感情、感情度を推定する上で重要となる音声特徴量を抽出し、更に確率的推定を実行することで、多様なコンテンツの音源要因に係らず、安定かつ高精度に感情、感情度を推定できる。
（３）請求項３，９に記載の発明によれば、コンテンツ中に含まれる音声信号データに加えて、映像信号データを分析することで、コンテンツの感情、感情度の推定精度をより高めることができる。
（４）請求項４，１０に記載の発明によれば、感情、感情度を推定する上で重要となる音声特徴量、映像特徴量を抽出し、更に確率的推定を実行することで、多様なコンテンツの音源要因、撮像状況に係らずより安定に精度よく感情、感情度を推定できる。
（５）請求項５，１１に記載の発明によれば、ユーザが現在視聴している、あるいは過去に視聴したコンテンツを手がかりとして、ユーザに検索要求を要請することなく、ユーザの嗜好に合った感情、感情度を持つコンテンツの検索・推薦が可能となる。
（６）請求項６，１２に記載の発明によれば、類似度順に提示するなど、従来の検索・推薦方法に加え、更に、コンテンツのタイトルや、コンテンツの属性情報、感情、感情度、サムネイル、部分コンテンツを要約コンテンツとして提示するなどの結果も合わせて表示することで、ユーザのコンテンツの内容の理解を促進することができ、よりユーザの要求に合致したコンテンツの検索・推薦が可能となる。
（７）請求項１３に記載の発明によれば、本発明によるコンテンツ検索・推薦方法をコンピュータによって実行することができる。 (1) According to the first and seventh aspects of the invention, by analyzing the audio signal data included in the content and extracting the emotion, the metadata about the emotion and the emotion level of the content is automatically generated. The content can be given to the content, and the content can be searched and recommended according to the emotion of the user based on the emotion and the emotion level of the content.
(2) According to the second and eighth aspects of the present invention, the voice feature quantity that is important in estimating the emotion and the emotion level is extracted, and further, the probabilistic estimation is performed, so that sound sources of various contents can be obtained. Regardless of factors, emotions and emotion levels can be estimated stably and accurately.
(3) According to the third and ninth aspects of the invention, the accuracy of estimation of emotion and emotion level of content can be further improved by analyzing video signal data in addition to audio signal data included in the content. Can do.
(4) According to the inventions described in claims 4 and 10, various voice features and video feature values that are important in estimating emotions and emotion levels are extracted, and further probabilistic estimation is performed. It is possible to estimate emotion and emotion level more stably and accurately regardless of the sound source factor of the content and the imaging situation.
(5) According to the inventions described in claims 5 and 11, the user's preference is met without requesting the user for a search request based on the content that the user is currently viewing or viewed in the past. Search / recommend content with emotion and emotion level.
(6) According to the inventions of claims 6 and 12, in addition to the conventional search / recommendation method such as presenting in order of similarity, the content title, content attribute information, emotion, emotion level, thumbnail By displaying the result of displaying partial content as summary content, etc., it is possible to promote understanding of the content of the user, and it is possible to search and recommend content that matches the user's request. .
(7) According to the invention described in claim 13, the content search / recommendation method according to the present invention can be executed by a computer.

以下、図面を参照しながら本発明の実施の形態を説明するが、本発明は下記の実施形態例に限定されるものではない。
[実施形態の第１例：音声信号データのみを用いたコンテンツ検索・推薦]
本発明の実施形態の第１例は、コンテンツに含まれる情報のうち、音声信号データのみを用いて感情、感情度を推定する場合である。この実施形態について、図１〜図１２を用いて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings, but the present invention is not limited to the following embodiments.
[First Example of Embodiment: Content Search / Recommendation Using Only Audio Signal Data]
The first example of the embodiment of the present invention is a case where the emotion and the emotion level are estimated using only the audio signal data among the information included in the content. This embodiment will be described with reference to FIGS.

本発明の実施形態に係るコンテンツ検索・推薦方法、コンテンツ検索・推薦装置について説明する。図１は、本発明の実施形態の第１例に係るコンテンツ検索・推薦方法の処理の流れを説明するフロー図、図２は、本発明の実施形態の第１例に係るコンテンツ検索・推薦装置を説明するブロック図である。 A content search / recommendation method and a content search / recommendation apparatus according to an embodiment of the present invention will be described. FIG. 1 is a flowchart illustrating a processing flow of a content search / recommendation method according to a first example of an embodiment of the present invention, and FIG. 2 is a content search / recommendation apparatus according to a first example of an embodiment of the present invention. FIG.

この実施形態におけるコンテンツ検索・推薦装置１００では、ユーザ端末２００から入力された検索要求が、所定の通信手段によって情報制御部３００に送信され、この情報制御部３００が、前記検索要求に類似したコンテンツ又は部分コンテンツを、データベース４００に蓄積されたコンテンツ又は部分コンテンツの中から検索し、所定の通信手段によって検索・推薦結果をユーザ端末２００に提示する。 In the content search / recommendation apparatus 100 in this embodiment, a search request input from the user terminal 200 is transmitted to the information control unit 300 by a predetermined communication means, and the information control unit 300 is similar to the search request. Alternatively, the partial content is searched from the content stored in the database 400 or the partial content, and the search / recommendation result is presented to the user terminal 200 by a predetermined communication means.

ユーザ端末２００の構成を説明するブロック図を図３に示す。ユーザ端末２００は、例えば、キーボード２１１、マウス等に代表されるポインティングデバイス２１２から構成される入力部２１０、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２２１、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）２２２、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）２２３から構成される制御部２２０、ＨＤＤ（ＨａｒｄＤｉｓｃＤｒｉｖｅ）２３１から構成される記憶部２３０、液晶画面等のモニタ画面２４１を有し、入力部２１０の操作に応じて制御部２２０から出力する情報を表示する表示部２４０を備えたものとする。 A block diagram illustrating the configuration of the user terminal 200 is shown in FIG. The user terminal 200 includes, for example, a keyboard 211, an input unit 210 including a pointing device 212 typified by a mouse, a CPU (Central Processing Unit) 221, a ROM (Read Only Memory) 222, and a RAM (Random Access Memory) 223. A control unit 220 including a storage unit 230 including a HDD (Hard Disc Drive) 231 and a monitor screen 241 such as a liquid crystal screen. Information output from the control unit 220 according to an operation of the input unit 210 It is assumed that a display unit 240 for displaying is provided.

情報制御部３００は、ＣＰＵ３０１、ＲＯＭ３０２、ＲＡＭ３０３、ＨＤＤ３０４などが相互接続され構成される。本発明における各種の処理は、全てこの情報制御部３００によって行われるものであり、各種処理を実現するプログラム及びデータは、全てＲＯＭ３０２やＨＤＤ３０４などの記憶装置に記憶され、適宜ＲＡＭ３０３に読み出され、ＣＰＵ３０１において処理が実行される。 The information control unit 300 is configured by connecting a CPU 301, a ROM 302, a RAM 303, an HDD 304, and the like. The various processes in the present invention are all performed by the information control unit 300, and all programs and data for realizing the various processes are stored in a storage device such as the ROM 302 and the HDD 304, and appropriately read out to the RAM 303. Processing is executed in the CPU 301.

以下、情報制御部３００、及び、データベース４００に備えられた機能部毎に、処理の流れを説明する。 Hereinafter, the flow of processing will be described for each function unit provided in the information control unit 300 and the database 400.

本発明の感情推定手段である感情推定部Ｆ１００の構成を説明するブロック図を図４に示す。感情推定部Ｆ１００は、コンテンツに含まれる音声信号データから、分析フレーム毎に、基本周波数、基本周波数の時間変動特性、振幅のｒｍｓ値（ｒｏｏｔｍｅａｎｓｑｕａｒｅ；自乗平均の平方根；振動正弦波の面積平均値）、振幅のｒｍｓ値の時間変動特性、パワー、パワーの時間変動特性、音声速度、音声速度の時間変動特性の少なくとも１つを音声特徴量として抽出する音声特徴量抽出部Ｆ１０１（音声特徴量抽出手段）と、学習用音声信号データを用いて予め構成された１つ以上の統計モデルである、音声モデルによって、音声特徴量の出現する確率として感情確率を計算する感情確率計算部Ｆ１０２（感情確率計算手段）と、感情確率に基づいて、１つ以上の分析フレームを含む部分コンテンツの感情度を計算する感情度計算部Ｆ１０３（感情度計算手段）と、感情確率に基づいて、１つ以上の分析フレームによって構成される部分コンテンツの感情を判定する感情判定部Ｆ１０４（感情判定手段）により構成する。感情度計算部Ｆ１０３及び感情判定部Ｆ１０４は、更に、コンテンツの感情及び感情度を、感情確率、又は部分コンテンツの感情度、又は部分コンテンツの感情と感情度に基づいてそれぞれ推定する。 FIG. 4 is a block diagram illustrating the configuration of the emotion estimation unit F100 that is the emotion estimation means of the present invention. The emotion estimation unit F100 calculates, from the audio signal data included in the content, the fundamental frequency, the time variation characteristic of the fundamental frequency, and the rms value of the amplitude (root mean square; square root of the root mean square; area average of the oscillating sine wave) Value), time variation characteristic of rms value of amplitude, power, time variation characteristic of power, voice speed, time variation characteristic of voice speed, and a voice feature quantity extraction unit F101 (voice feature quantity) And an emotion probability calculation unit F102 (emotion) that calculates an emotion probability as a probability of occurrence of a voice feature amount by a voice model, which is one or more statistical models preliminarily configured using the voice signal data for learning) A probability calculation means) and an emotion level for calculating the emotion level of a partial content including one or more analysis frames based on the emotion probability A calculation unit F 103 (emotional calculation means), based on the emotion probability, constituted by determining emotion determination unit emotions composed partial contents by one or more analysis frames F 104 (emotion determination means). The emotion level calculation unit F103 and the emotion determination unit F104 further estimate the content emotion and emotion level based on the emotion probability, the partial content emotion level, or the partial content emotion and emotion level, respectively.

感情推定部Ｆ１００によって実行されるステップＳ１００は、本発明によって、実際にコンテンツ又は部分コンテンツの検索・推薦を行う前に、予め行っておくステップであり、コンテンツ、部分コンテンツの感情、感情度を推定するステップである。ステップＳ１００の処理の流れを説明するフロー図を図５に示す。ステップＳ１１０は、感情、感情度を求めるために必要となる感情確率を計算するための統計モデルを構築するための処理であり、ステップＳ１２０、ステップＳ１３０はコンテンツ、部分コンテンツの感情確率を計算するための処理である。また、ステップＳ１４０は、コンテンツ、部分コンテンツの感情、感情度を推定するための処理である。
まず予め、後に説明する手順の一例のように、ステップＳ１１０において、学習用音声信号データに基づいて、予め感情確率を計算するための音声モデルを獲得しておく。 Step S100 executed by the emotion estimation unit F100 is a step performed in advance according to the present invention before actually searching or recommending content or partial content, and estimates the emotion and emotion level of the content and partial content. It is a step to do. FIG. 5 shows a flowchart for explaining the flow of processing in step S100. Step S110 is a process for constructing a statistical model for calculating an emotion probability necessary for obtaining the emotion and the emotion level, and Step S120 and Step S130 are for calculating the emotion probability of the content and the partial content. It is processing of. Step S140 is processing for estimating the emotion and the emotion level of the content and the partial content.
First, as in an example of a procedure to be described later, in step S110, a speech model for calculating an emotion probability is acquired in advance based on learning speech signal data.

ステップＳ１２０では、音声特徴量抽出部Ｆ１０１が、取り込まれたコンテンツの音声信号データから、所望の音声特徴量として分析フレーム（以下、フレームと呼ぶ）毎に計算し、抽出する。この音声特徴量は、基本周波数、基本周波数の時間変動特性、振幅のｒｍｓ値、振幅のｒｍｓ値の時間変動特性、パワー、パワーの時間変動特性、音声速度、音声速度の時間変動特性のうち１つ以上の要素で構成される。 In step S120, the audio feature amount extraction unit F101 calculates and extracts a desired audio feature amount for each analysis frame (hereinafter referred to as a frame) from the acquired audio signal data of the content. This voice feature amount is one of fundamental frequency, time variation characteristic of fundamental frequency, rms value of amplitude, time variation characteristic of amplitude rms value, power, time variation characteristic of power, voice speed, and time variation characteristic of voice speed. Consists of two or more elements.

ステップＳ１３０では、感情確率計算部Ｆ１０２が、ステップＳ１２０において計算された音声特徴量に基づき、フレーム毎に、コンテンツ、部分コンテンツの感情において音声特徴量が出現する確率を、ステップＳ１１０において予め獲得された音声モデルによって計算することで感情確率を求める。 In step S130, the emotion probability calculation unit F102 has previously acquired, in step S110, the probability that the voice feature amount appears in the emotion of the content or partial content for each frame based on the voice feature amount calculated in step S120. The emotion probability is calculated by calculating with a speech model.

ステップＳ１４０では、ステップＳ１３０で計算したコンテンツ、部分コンテンツのフレーム毎の感情確率に基づいて、感情度計算部Ｆ１０３及び感情判定部Ｆ１０４がそれぞれ、コンテンツ、部分コンテンツの感情及び感情度を推定する。 In step S140, based on the emotion probabilities for each frame of the content and partial content calculated in step S130, the emotion level calculation unit F103 and the emotion determination unit F104 estimate the emotion and emotion level of the content and partial content, respectively.

以下に、各ステップについて詳細を説明する。 Details of each step will be described below.

まず、ステップＳ１２０では、取り込まれたコンテンツの音声信号データから、所望の音声特徴量をフレーム毎に抽出する。 First, in step S120, a desired audio feature amount is extracted for each frame from the audio signal data of the captured content.

以下に、音声特徴量抽出方法の１例について説明する。 Hereinafter, an example of the speech feature extraction method will be described.

ここで、各音声特徴量について説明する。コンテンツ、部分コンテンツの感情を推定するにあたり、音声特徴量としては、高次元音声パラメータの解析を必要とする音韻情報と比較して、多様な音源要因の混在した音声に対しても安定して得られ、コンテンツジャンルなど、コンテンツの属性に依存しにくいものが好ましい。 Here, each voice feature amount will be described. In estimating emotions of content and partial content, the audio feature value is obtained stably for speech mixed with various sound source factors, compared to phonological information that requires analysis of high-dimensional speech parameters. It is preferable that the content genre is less dependent on the content attribute.

例えば、音声認識等を用いて音声をテキスト情報に変換する等の方法は、このような音韻情報を必要とし、例えば、ニュース映像等の発話者の音声が鮮明に聴き取れるジャンルのコンテンツについては有効である。しかし、映画、ドラマや、ホームビデオ等においては、発話以外にも、背景音楽、環境音等の様々な音源要因が存在するために、発話を鮮明に聴き取ることができず、音声認識が難しい。更に、必ずしも発話のみによってコンテンツの感情が決定されるとは限らず、印象や雰囲気を含めた感情を推定するという目的においては、音楽、効果音、環境音等も重要な要因として扱える音声特徴量が必要である。 For example, a method of converting speech into text information using speech recognition or the like requires such phonological information, and is effective for content of a genre in which a speech of a speaker such as a news video can be heard clearly. It is. However, in movies, dramas, home videos, etc., there are various sound source factors such as background music and environmental sounds in addition to utterances, so the utterances cannot be heard clearly and speech recognition is difficult. . Furthermore, the emotion of the content is not always determined only by utterances. For the purpose of estimating emotions including impressions and atmosphere, audio features that can be treated as important factors such as music, sound effects, and environmental sounds is required.

このような問題に対して、本発明の実施形態の第１例では、韻律情報、特に、基本周波数、基本周波数の時間変動特性、振幅のｒｍｓ値（以下、単にｒｍｓと呼ぶ）、ｒｍｓの時間変動特性、パワー、パワーの時間変動特性、音声速度、音声速度の時間変動特性等を抽出する。特に、時間変動特性として数種の短時間変化量を用いることによって、コンテンツに含まれる感情を抽出する場合においての感情的な音声における重要な挙動を検出することが可能となる。 In order to solve such a problem, in the first example of the embodiment of the present invention, the prosody information, in particular, the fundamental frequency, the time variation characteristic of the fundamental frequency, the rms value of the amplitude (hereinafter simply referred to as rms), and the time of rms. The fluctuation characteristics, power, time fluctuation characteristics of power, voice speed, voice speed time fluctuation characteristics, and the like are extracted. In particular, by using several types of short-time change amounts as the time variation characteristics, it is possible to detect important behaviors in emotional speech when emotions included in content are extracted.

時間変動特性の例としては、例えば、フレーム間差分や、回帰係数がある。また、パワーは、パワースペクトル密度などを用いるのでもよい。基本周波数、パワーの抽出法は様々あるが、公知であり、その詳細については、例えば非特許文献１に記載の方法等を参照されたい。 Examples of time variation characteristics include inter-frame differences and regression coefficients. The power may be a power spectral density. There are various methods for extracting the fundamental frequency and power, but they are known. For details, see the method described in Non-Patent Document 1, for example.

また、発話速度、音楽リズム、テンポ等を含めた音声速度については、例えば非特許文献２に開示されている方法などによって、動的尺度として抽出することができる。例えば、動的尺度のピークを検出し、その数をカウントすることで音声速度を検出する方法をとってもよく、また、音声速度の時間変動特性に相当するピーク間隔の平均値、分散値を計算して音声速度の時間変動特性を検出する方法をとるのでもよい。以下、本発明の実施形態の第１例では、音声速度として動的尺度のピーク間隔平均値を用いるものとする。 The voice speed including speech speed, music rhythm, tempo, and the like can be extracted as a dynamic scale by a method disclosed in Non-Patent Document 2, for example. For example, it is possible to detect the voice speed by detecting the peak of the dynamic scale and counting the number, and calculating the average value and variance value of the peak interval corresponding to the time fluctuation characteristic of the voice speed. Thus, a method of detecting the time variation characteristic of the voice speed may be adopted. Hereinafter, in the first example of the embodiment of the present invention, a peak interval average value of a dynamic scale is used as the voice speed.

これらの音声特徴量を、フレーム毎に抽出する方法の１例を説明する。１フレームの長さ（以下、フレーム長とよぶ）を、例えば５０ｍｓとし、次のフレームは現フレームに対して、例えば、２０ｍｓの時間シフトによって形成されるものとする。図６に示すように、これらのフレーム毎に、各フレーム内での各音声特徴量の平均値、つまり、平均基本周波数、基本周波数の平均時間変動特性、平均ｒｍｓ、ｒｍｓの平均時間変動特性、平均パワー、パワーの平均時間変動特性、動的尺度の平均ピーク間隔平均値などを計算するものとする。あるいは、これらの平均値のみではなく、フレーム内での各音声特徴量の最大値、最小値、または変動幅などを計算して用いてもよい。 An example of a method for extracting these audio feature amounts for each frame will be described. The length of one frame (hereinafter referred to as the frame length) is, for example, 50 ms, and the next frame is formed by a time shift of, for example, 20 ms with respect to the current frame. As shown in FIG. 6, for each of these frames, the average value of each voice feature amount in each frame, that is, the average fundamental frequency, the average frequency variation characteristic of the fundamental frequency, the average rms, the average time variation characteristic of rms, It is assumed that average power, average time variation characteristic of power, average peak interval average value of dynamic scale, and the like are calculated. Alternatively, not only the average value but also the maximum value, the minimum value, or the fluctuation range of each voice feature amount in the frame may be calculated and used.

ここで、コンテンツ中の感情的な部分に特徴的に現れる音声においては、基本周波数そのものの抽出が困難な場合が多く、しばしば欠損することがある。このため、そのような欠損を補完する効果を容易に得ることのできる、基本周波数の時間変動特性を含むことが好ましい。 Here, in the sound that appears characteristically in the emotional part of the content, it is often difficult to extract the fundamental frequency itself, which is often lost. For this reason, it is preferable to include a time-variation characteristic of the fundamental frequency that can easily obtain the effect of complementing such a defect.

更には、話者依存性を低く抑えたまま、判定精度を高めるため、パワーの時間変動特性を更に含むことが好ましい。以上、フレーム毎に行った音声特徴量の抽出処理を、コンテンツ全てに渡るフレームに対して行うことで、全てのフレームにおいて音声特徴量を得ることが可能である。 Furthermore, it is preferable to further include a time variation characteristic of power in order to improve the determination accuracy while keeping speaker dependency low. As described above, the audio feature amount extraction processing performed for each frame is performed on the frames over the entire content, so that the audio feature amounts can be obtained in all the frames.

ステップＳ１３０では、ステップＳ１２０において抽出された各フレームの音声特徴量と、ステップＳ１１０において予め構成しておいた１つ以上の音声モデルとを用いて、各コンテンツ、部分コンテンツの感情における音声特徴量の感情確率が計算される。 In step S130, using the audio feature value of each frame extracted in step S120 and one or more audio models configured in advance in step S110, the audio feature value in the emotion of each content and partial content is determined. Emotion probability is calculated.

ここではまず、音声モデルを構成するためのステップＳ１１０の処理の１例について説明する。音声モデルは、学習用音声信号データから、学習を行うことによって獲得する。学習用音声信号データは、コンテンツの音声信号データ同様、フレーム単位で音声特徴量が抽出されており、更に、人手によって、例えば、“楽しい”、“哀しい”、“怖い”、“激しい”、“かっこいい”、“かわいい”、“エキサイティング”、“情熱的”、“ロマンチック”、“暴力的”、“穏やか”、“癒される”、“暖かい”、“冷たい”、“不気味”などの、予め感情カテゴリとして定めた種類のラベルが付与されているものとする。 Here, first, an example of the process of step S110 for configuring a speech model will be described. The speech model is acquired by performing learning from the speech signal data for learning. As in the case of the content audio signal data, the audio feature amount is extracted in units of frames in the learning audio signal data, and further, for example, “fun”, “sad”, “scary”, “furious”, “ Emotions such as “cool”, “cute”, “exciting”, “passionate”, “romantic”, “violent”, “gentle”, “healed”, “warm”, “cold”, “creepy” It is assumed that a label of a type determined as a category is given.

ここで、各感情カテゴリを順にｅ¹、ｅ²、・・・と表記し、感情カテゴリの数を＃（Ｋ）と表す。これらの感情カテゴリと、音声モデルを対応付けることで、感情カテゴリ毎に感情確率を計算するための音声モデルを獲得する。この音声モデルとしては、例えば、正規分布、混合正規分布、隠れマルコフモデル、一般化状態空間モデルなどを用いる。好ましくは、感情の時間遷移をモデル化できる、隠れマルコフモデル、一般化状態空間モデルなどの時系列モデルを採用する。 Here, each emotion category is expressed in order as e ¹ , e ² ,..., And the number of emotion categories is expressed as # (K). By associating these emotion categories with the speech model, a speech model for calculating the emotion probability for each emotion category is acquired. As this speech model, for example, a normal distribution, a mixed normal distribution, a hidden Markov model, a generalized state space model, or the like is used. Preferably, a time series model such as a hidden Markov model or a generalized state space model that can model the temporal transition of emotion is employed.

これらの音声モデルのパラメータの推定方法は、例えば、最尤推定法や、ＥＭアルゴリズム、変分ベイズ法などが公知のものとして知られており、用いることができる。詳しくは非特許文献３、非特許文献４などを参照されたい。 As a method for estimating the parameters of these speech models, for example, a maximum likelihood estimation method, an EM algorithm, a variational Bayes method, and the like are known and can be used. For details, refer to Non-Patent Document 3, Non-Patent Document 4, and the like.

ステップＳ１３０では、ステップＳ１１０で獲得した音声モデルに、ステップＳ１２０で抽出した分析対象となるコンテンツのフレーム毎の音声特徴量を入力することで、フレーム毎の感情確率を計算する。ステップＳ１１０において、感情カテゴリ毎に確率を計算することができるように音声モデルを構築したため、各々の音声モデルに音声特徴量を入力することで、各感情カテゴリの確率である、感情確率を計算することができる。 In step S130, the voice feature amount for each frame of the content to be analyzed extracted in step S120 is input to the voice model acquired in step S110, thereby calculating the emotion probability for each frame. In step S110, since the speech model is constructed so that the probability can be calculated for each emotion category, the emotion probability that is the probability of each emotion category is calculated by inputting the speech feature amount to each speech model. be able to.

ステップＳ１４０では、ステップＳ１３０において計算された感情確率に基づいて、コンテンツ、部分コンテンツの感情、感情度が推定される。 In step S140, the emotion and the emotion level of the content and the partial content are estimated based on the emotion probability calculated in step S130.

以下、ステップＳ１４０の処理の１例について説明する。ステップＳ１４０の処理の流れを説明するフロー図を図７に示す。 Hereinafter, an example of the process of step S140 will be described. FIG. 7 is a flowchart for explaining the flow of processing in step S140.

まず、ステップＳ１４１において、１つ以上のフレームによって構成される部分コンテンツを生成する。 First, in step S141, partial content composed of one or more frames is generated.

本発明の実施形態の第１例では、連続する音声であると考えられる音声区間の集合は１つの区間としてまとめる処理を行っておく。以下、この連続する音声で構成される音声区間を部分コンテンツとみなす。音声区間を生成する方法は、例えば、音声波形における連続音声区間の周期性を利用して、自己相関関数の変化がある一定値を越えた点を区間境界とする方法がある。 In the first example of the embodiment of the present invention, a process of collecting a set of speech sections considered to be continuous speech as one section is performed. Hereinafter, the audio section composed of the continuous audio is regarded as partial content. As a method of generating a speech section, for example, there is a method of using a periodicity of continuous speech sections in a speech waveform as a section boundary at a point where the change of the autocorrelation function exceeds a certain value.

次に、ステップＳ１４２において、構成した部分コンテンツ単位での感情度を、ステップＳ１３０において、感情カテゴリ毎に計算した感情確率に基づいて計算する。 Next, in step S142, the emotion level for each configured partial content is calculated based on the emotion probability calculated for each emotion category in step S130.

以下、この感情度を計算する方法の１例について説明する。コンテンツ中の部分コンテンツＳの集合を時刻の早いものから順に{Ｓ₁，Ｓ₂，・・・，Ｓ_NS}とする。ここで、ＮＳは部分コンテンツの総数である。また、ある部分コンテンツＳ_iに含まれるフレームを{ｆ₁，ｆ₂，・・・，ｆ_NFi}と置く。ここで、ＮＦｉは部分コンテンツＳ_iに含まれるフレーム数である。各フレームｆ_tは、ステップＳ１３０において、フレーム単位でのｋ番目の感情カテゴリｅ^kの感情確率ｐｆ_t（ｅ^k）が与えられている。これを用いて、ｋ番目の感情カテゴリｅ^kの部分コンテンツＳ_iの感情度ｐＳ_i（ｅ^k）は、例えば、平均値を表す Hereinafter, an example of a method for calculating the emotion level will be described. Assume that a set of partial contents S in the content is {S ₁ , S ₂ ,..., S _NS } in order from the earliest time. Here, NS is the total number of partial contents. Also, frames included in a certain partial content S _i are set as {f ₁ , f ₂ ,..., F _NFi }. Here, NFi is the number of frames included in the partial content S _i . Each frame f _t in step S130, the emotion probability pf _t of k-th emotional category e ^k in units of frames (e ^k) is given. Using this, the emotion level pS _i (e ^k ) of the partial content S _i of the ^kth emotion category ek represents, for example, an average value.

として計算することや、最大値を表す次式によって計算する。 Or the following formula representing the maximum value.

以上のような計算を、全ての部分コンテンツに渡り行うことで、全ての部分コンテンツに対して感情カテゴリ毎の感情度を計算することが可能である（図８）。 By performing the above calculation over all partial contents, it is possible to calculate the emotion level for each emotion category for all partial contents (FIG. 8).

部分コンテンツの感情については、例えば、感情度が最大値となる感情カテゴリを、部分コンテンツＳｉの感情とする。 For the emotion of the partial content, for example, the emotion category having the maximum emotion level is set as the emotion of the partial content Si.

また、フレーム毎の感情の推移を基に、部分コンテンツの感情を推定するのでもよい。 Further, the emotion of the partial content may be estimated based on the transition of the emotion for each frame.

以下、この方法によって感情度を計算する方法の１例について説明する。 Hereinafter, an example of a method for calculating the emotion level by this method will be described.

部分コンテンツ内に含まれる、１つ以上の所定の数Ｌの連続するフレームが存在する区間を小部分コンテンツと呼ぶ。このとき、この区間に含まれるそれぞれのフレームの感情を、ｅｆ₁、ｅｆ₂、・・・、ｅｆ_Lと表記する。この感情は、フレーム毎に計算された感情確率を用いて、例えば、その値が最大となる感情カテゴリをその各フレームの感情とすることができる。 A section including one or more predetermined number L of continuous frames included in the partial content is called a small partial content. At this time, emotions of the respective frames included in this section are expressed as ef ₁ , ef ₂ ,..., Ef _L. For this emotion, using the emotion probability calculated for each frame, for example, the emotion category having the maximum value can be set as the emotion of each frame.

この１つ以上の連続するフレームの感情の遷移系列{ｅｆ₁、ｅｆ₂、・・・、ｅｆ_L}を、感情カテゴリと対応付けた、１つ以上の音声モデルによってモデル化する。音声モデルの例としては、Ｎ−ｇｒａｍ、隠れマルコフモデル、一般化状態空間モデルなどを用いることができる。この音声モデルは、学習用音声信号データを用いて学習を行うことによって獲得する。この場合、学習用のデータとしては、ステップＳ１１０で用いた学習用音声信号データを同様に用いることができる。好ましくは、この方法を採用する場合には、これらの音声モデルの構築も、ステップＳ１１０内で実行しておく。 The emotion transition sequence {ef ₁ , ef ₂ ,..., Ef _L } of one or more consecutive frames is modeled by one or more speech models associated with the emotion category. As an example of the speech model, an N-gram, a hidden Markov model, a generalized state space model, or the like can be used. This speech model is acquired by performing learning using the speech signal data for learning. In this case, the learning speech signal data used in step S110 can be used in the same manner as the learning data. Preferably, when this method is employed, the construction of these speech models is also executed in step S110.

この方法により、部分コンテンツ内の小部分コンテンツ単位で、新たに感情確率を計算することが可能である。感情度の計算は、例えば、以下の手順で行うことができる。 By this method, it is possible to newly calculate the emotion probability in units of small partial contents in the partial contents. The calculation of the emotion level can be performed, for example, by the following procedure.

部分コンテンツＳ_iに含まれる小部分コンテンツを{Ｓ₁，Ｓ₂，・・・，Ｓ_NSi}と置く。ここで、Ｎｓｉは部分コンテンツＳ_iに含まれる小部分コンテンツ数である。各小部分コンテンツＳ_tは、ステップＳ１３０において、フレーム単位でのｋ番目の感情カテゴリｅ^kの感情確率ｐｓ_t（ｅ^k）が与えられている。これを用いて、ｋ番目の感情カテゴリｅ^kの部分コンテンツＳ_iの感情度ｐＳ_i（ｅ^k）は、例えば、平均値を表す Small content included in the partial content S _i is set as {S ₁ , S ₂ ,..., S _NSi }. Here, Nsi is a small number of partial contents contained in the partial content S _i. Each small partial content S _t, in step S130, the emotion probability of k-th emotional category e ^k in units of frames ps _t (e ^k) is given. Using this, the emotion level pS _i (e ^k ) of the partial content S _i of the ^kth emotion category ek represents, for example, an average value.

これら以外にも、例えば、部分コンテンツ内でフィルタ処理を行ってから感情度を計算するなど、方法は様々あるが、部分コンテンツ間で感情度を比較する場合があるため、感情度はある一定の値の範囲内、例えば０〜１の間に収まるようにすることが好ましい。 In addition to these, there are various methods such as calculating the emotion level after performing filtering within the partial content, but the emotion level may be compared between the partial content, so the emotion level is a certain level It is preferable to be within the range of values, for example, between 0 and 1.

次に、ステップＳ１４３において、コンテンツの感情、感情度を推定する。 Next, in step S143, the emotion and emotion level of the content are estimated.

この方法の１例としては、例えば、コンテンツ全体に渡り、フレーム毎の感情確率、又は、全ての部分コンテンツの感情度の平均値、最大値などを計算することにより、これをコンテンツ全体の感情度としてもよい。また、その他の例として、この感情度の最も大きい感情カテゴリをコンテンツの感情としてもよい。より簡単には、各感情と判定された部分コンテンツの継続時間を計算し、この継続時間が最も長い感情カテゴリをコンテンツの感情と判定してもよく、この場合、部分コンテンツの感情のみを参照することでコンテンツの感情を判定することができる。 As an example of this method, for example, by calculating the emotion probability for each frame or the average value or maximum value of the emotion level of all partial contents over the entire content, It is good. As another example, the emotion category having the highest emotion level may be used as the content emotion. More simply, the duration of the partial content determined as each emotion may be calculated, and the emotion category having the longest duration may be determined as the emotion of the content. In this case, only the emotion of the partial content is referred to It is possible to determine the emotion of the content.

また、コンテンツにおいて、各部分コンテンツが存在する時刻、例えば、開始時刻、終了時刻、中央時刻などを保存しておくことで、図９に示すように、コンテンツを通して感情度がどのように変化するのかに関する時系列情報を得ることもできる。 In addition, in the content, how the emotion level changes through the content as shown in FIG. 9 by storing the time at which each partial content exists, for example, the start time, the end time, the central time, etc. It is also possible to obtain time-series information regarding.

この時系列情報を利用して、コンテンツの特定の時刻付近で最も多く出現している感情カテゴリを、コンテンツ全体の感情としてもよい。例えば、映画やドラマなどのコンテンツの場合、コンテンツの終端時刻付近に、感情に残る重要なシーンが存在している場合が多いが、こういった場合に、終端時刻付近に最も多く出現している感情カテゴリをコンテンツの感情とする、などの利用が可能である。また、この際、終端時刻付近の感情度を参照し、この感情度の最も高い感情カテゴリをコンテンツの感情としてもよい。 Using this time series information, the emotion category that appears most frequently around a specific time of the content may be used as the emotion of the entire content. For example, in the case of content such as movies and dramas, there are many important scenes that remain in the emotion near the end time of the content, but in such cases, the most frequently appears near the end time. It is possible to use the emotion category as the emotion of the content. At this time, the emotion level near the end time may be referred to, and the emotion category having the highest emotion level may be used as the content emotion.

図２のコンテンツ蓄積部Ｆ２００（コンテンツ蓄積手段）は、コンテンツ又は部分コンテンツを、感情推定部Ｆ１００が推定した感情、及び感情度と対応づけて、データベース４００に格納する。このデータベース４００は所定の記憶装置とコンテンツ又は部分コンテンツデータによって構成される。この所定の記憶装置は、例えば、個人、家庭内などの比較的小規模な利用範囲の場合は、ユーザ端末内ＨＤＤ、ＨＤＤレコーダ内のＨＤＤ、又はＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）などによってユーザ端末と接続された所定のサーバ装置内のＨＤＤとしてもよい。また、ＤＶＤなど持ち出し可能な外部記憶装置によって構成するのでもよい。大規模な利用範囲の場合には、インターネットなどの広域通信網などによってユーザ端末と接続された所定のサーバ装置を伴う記憶装置としてもよい。 The content storage unit F200 (content storage unit) in FIG. 2 stores the content or partial content in the database 400 in association with the emotion and emotion level estimated by the emotion estimation unit F100. The database 400 includes a predetermined storage device and content or partial content data. For example, in the case of a relatively small usage range such as an individual or a home, the predetermined storage device is connected to the user terminal by an HDD in the user terminal, an HDD in the HDD recorder, or a LAN (Local Area Network). It may be an HDD in a predetermined server device. Further, it may be constituted by an external storage device that can be taken out such as a DVD. In the case of a large usage range, a storage device with a predetermined server device connected to the user terminal by a wide area communication network such as the Internet may be used.

図１のステップＳ２００において、コンテンツ蓄積部Ｆ２００は、図５のステップＳ１００を通して計算されたコンテンツの属性情報である感情及び感情度を含めたコンテンツ又は部分コンテンツをデータベース４００に蓄積する。 In step S200 of FIG. 1, the content storage unit F200 stores in the database 400 content or partial content including emotion and emotion level, which is content attribute information calculated through step S100 of FIG.

図２の検索要求受付部Ｆ３００（検索要求受付手段）は、ユーザ端末２００から入力された検索要求を、所定の通信手段を介して受け付ける。以下、ユーザがユーザ端末２００を用いて検索要求を入力し、検索要求受付部Ｆ３００が検索要求を受け付ける処理であるステップＳ３００について説明する。図１０は、ステップＳ３００について説明する図である。 The search request receiving unit F300 (search request receiving means) in FIG. 2 receives a search request input from the user terminal 200 via a predetermined communication means. Hereinafter, step S300, which is a process in which the user inputs a search request using the user terminal 200 and the search request reception unit F300 receives the search request, will be described. FIG. 10 is a diagram illustrating step S300.

ステップＳ３００Ａでは、ユーザが、所望のコンテンツの感情と感情度を選択し、これを検索要求として入力する。この検索要求は、例えば、図３のキーボード２１１によって感情と感情度を直接入力するものでもよいし、モニタ２４１に表示されたプルダウンメニューなどから選択するものでもよい。 In step S300A, the user selects the emotion and emotion level of the desired content and inputs this as a search request. This search request may be, for example, an input of emotion and emotion level directly from the keyboard 211 of FIG. 3, or a selection from a pull-down menu displayed on the monitor 241.

感情と感情度を直接入力する方法の１例としては、各カテゴリに対応する感情語とその強さを表すものを入力する。例えば、ユーザは、「感情：楽しい、感情度：０．９」などと入力することで、「感情：楽しい、感情度：０．９」を検索要求とする。 As an example of a method for directly inputting an emotion and an emotion level, an emotion word corresponding to each category and an expression representing its strength are input. For example, the user inputs “emotion: fun, emotion level: 0.9” to make “emotion: fun, emotion level: 0.9” a search request.

また、感情カテゴリと直接等しい感情語が入力されないような場合は、同意的な感情カテゴリに対応する感情語と見做し、適宜変換して処理することが考えられる。例えば、ユーザが、「笑える」などと入力し、感情カテゴリに“笑える”が用意されていなくとも、例えば、同意的と類推される“楽しい”、“面白い”などの感情カテゴリが用意されていれば、これらの感情カテゴリに属すると見做して処理を実行することができる。 If an emotion word that is directly equal to the emotion category is not input, it can be considered that the emotion word corresponds to the consensus emotion category, and is appropriately converted and processed. For example, even if the user inputs “laughter” and “laughter” is not prepared in the emotion category, for example, an emotion category such as “fun” or “interesting” that is inferred to be consensus is prepared. For example, the processing can be executed assuming that the emotion category belongs.

この時、比較的１つの感情カテゴリに帰属させることが難しい感情語については、複数の感情カテゴリに対応づけるものとしてもよい。例えば、ユーザが、「感情：不気味な、感情度：０．６」と入力した場合、感情カテゴリに“不気味な”が用意されていなくとも、例えばこれを、「感情：怖い、感情度：０．３ＡＮＤ感情：不思議、感情度：０．３」などとして処理を実行することができる。 At this time, emotion words that are relatively difficult to be attributed to one emotion category may be associated with a plurality of emotion categories. For example, when the user inputs “emotion: creepy, emotion level: 0.6”, even if “eerie” is not prepared in the emotion category, for example, this is expressed as “emotion: scary, emotion level: 0”. .3 AND Emotion: Mysterious, Emotional Level: 0.3 ”can be executed.

更には、ユーザの入力は自然文であってもよい。この場合、自然文の中から、形態素解析などの方法によって、感情語を抽出し、対応する感情、感情度に変換すればよい。例えば、ユーザが“超面白い”と入力した場合、“超”→感情度：０．９、“面白い”→感情：楽しい、などと変換できる。 Further, the user input may be a natural sentence. In this case, an emotion word may be extracted from a natural sentence by a method such as morphological analysis and converted into a corresponding emotion and emotion level. For example, when the user inputs “super funny”, it can be converted into “super” → emotion level: 0.9, “interesting” → emotion: fun, and so on.

こういった変換は、非特許文献５などに見られるような体系的辞書や、非特許文献６に示されている方法等に基づいて実行することができる。また、“超”→感情度：０．９などに関して、相対的な表現であると解釈できるものについては、特許文献２の方法を用いることができる。このような方法については、予め所定の変換規則を、設計者が設計するのでもよいし、ユーザの実際の主観的感覚を考慮するために学習を用いて構築してもよい。 Such a conversion can be executed based on a systematic dictionary as found in Non-Patent Document 5 or the like, a method shown in Non-Patent Document 6, or the like. In addition, regarding “super” → emotion level: 0.9 or the like, the method of Patent Document 2 can be used for what can be interpreted as a relative expression. For such a method, a predetermined conversion rule may be designed in advance by a designer, or may be constructed using learning in order to take into account the actual subjective feeling of the user.

また、感情や感情度といった感性的な表現は、ユーザにとってはテキスト情報として的確に表現することが難しい場合がある。このような場合は、直観的に分かりやすいグラフィカルなインタフェースによって、検索要求を入力することもできる。例えば、図１１に示すような、各感情の感情度を軸に取ったグラフをモニタ２４１に表示し、ユーザが嗜好の値をポインティングデバイス２１２によって選択することでこれを検索要求の入力としてもよいし、図１２に示すような、音楽プレイヤーなどに搭載されている機能であるイコライザ形のインタフェースをモニタＳ２４１に表示し、ユーザがポインティングデバイス２１２の操作によって、各感情の感情度を調整して入力するものでもよい。 In addition, it may be difficult for the user to express emotional expressions such as emotions and emotion levels accurately as text information. In such a case, it is possible to input a search request through an intuitive and easy-to-understand graphical interface. For example, as shown in FIG. 11, a graph centered on the emotion level of each emotion may be displayed on the monitor 241, and the user may select a preference value using the pointing device 212, and this may be input as a search request. Then, an equalizer-type interface that is a function installed in a music player or the like as shown in FIG. 12 is displayed on the monitor S241, and the user adjusts and inputs the emotion level of each emotion by operating the pointing device 212. You may do it.

また、これらの方法については、感情カテゴリ全ての感情度を入力する必要はなく、要求したい感情カテゴリについてのみ、感情度を選択するのでもよい。より簡単には、ユーザが所望する感情のみを選択するのでもよい。 In addition, for these methods, it is not necessary to input the emotion level of all emotion categories, and the emotion level may be selected only for the emotion category desired to be requested. More simply, only the emotion desired by the user may be selected.

ステップ３００Ｂでは、ユーザが視聴している、視聴していたコンテンツの感情と感情度を参照し、これに基づいて検索要求を決定する。決定の方法としては、例えば、ユーザが視聴しているコンテンツの感情と感情度を、直接検索要求とみなしてもよい。 In step 300B, the search request is determined based on the emotion and the emotion level of the content that the user is viewing and viewing. As a determination method, for example, the emotion and the emotion level of the content viewed by the user may be regarded as a direct search request.

このステップ３００Ｂを経ることで、ユーザが直接検索要求を入力しない場合であっても、ユーザの視聴しているコンテンツの感情、感情度を検索要求とすることで、類似検索を実行することができるという利点がある。 Through this step 300B, even if the user does not directly input a search request, a similarity search can be executed by using the emotion and emotion level of the content being viewed by the user as a search request. There is an advantage.

この際、ユーザ毎にコンテンツの視聴履歴を記録しておき、過去に視聴したコンテンツの感情、感情度の情報を基に検索要求に補正を加える処理を行ってもよい。この処理を加えることにより、視聴している、もしくは視聴していたコンテンツの履歴を手がかりとして、これらと類似する感情、感情度を持つコンテンツの検索・推薦を行い、ユーザ毎の嗜好を適応的に反映した検索・推薦結果を提示することが可能となる。 At this time, the viewing history of the content may be recorded for each user, and processing for correcting the search request may be performed based on the emotion and emotion level information of the content viewed in the past. By adding this processing, it is possible to search / recommend content that has similar emotions and emotions based on the history of the content that is being viewed or viewed, and adaptively adapt the preferences for each user. It is possible to present the reflected search / recommendation results.

例えば、過去に視聴したコンテンツについて、その視聴時間などを含めて保存しておき、例えば、過去数時間、もしくは数コンテンツの感情度に基づいて検索要求を決定すればよい。例えば、視聴したコンテンツの感情度について、重み付け平均値や最大値を計算し、これを検索要求とみなしてもよい。重み付け平均値についての重みの決定方法としては、例えば、最近視聴されたコンテンツほど重みが大きくなるようにすることや、エビングハウスの忘却曲線など、心理学の知見を取り入れた忘却モデルを導入し、これを過去のコンテンツの視聴時間と対応付けて重みを決定してもよい。 For example, content that has been viewed in the past may be stored including the viewing time and the like, and for example, a search request may be determined based on the past several hours or the emotion level of several content. For example, a weighted average value or maximum value may be calculated for the emotion level of the viewed content, and this may be regarded as a search request. As a method for determining the weight for the weighted average value, for example, a forgetting model that incorporates psychological knowledge such as increasing the weight of recently watched content or the forgetting curve of Ebbinghouse is introduced. May be determined in association with the viewing time of past contents.

即ち、例えば、あるユーザが、過去数時間に視聴したコンテンツの数をＮＫとおき、各コンテンツのｋ番目の感情カテゴリｅ^kの感情度が、ｐＳ¹（ｅ^k）、ｐＳ²（ｅ^k）、・・・、ｐＳ^NK（ｅ^k）と与えられているとする。 That is, for example, the number of contents viewed by a certain user in the past several hours is set as NK, and the emotion level of the ^kth emotion category ek of each content is pS ¹ (e ^k ), pS ² (e ^k ). ,..., PS ^NK (e ^k ).

この時、ｋ番目の感情カテゴリｅ^kについての検索要求Ｑ（ｅ^k）は、例えば、 At this time, the search request Q (e ^k ) for the ^kth emotion category e ^k is, for example,

によって計算できる。ここで、ｗｊは重みであり、前述の重みの決定方法によって決定すればよい。 Can be calculated by Here, wj is a weight and may be determined by the above-described weight determination method.

また更に同様の方法によって、コンテンツ毎に、過去数時間における視聴回数を記録しておき、これに基づいて検索要求に補正を加える処理を行ってもよい。この場合には、例えば、検索要求に対して、最近視聴された回数の多い上位幾つか、例えば１０、のコンテンツの感情、感情度との重み付け平均値を計算し、これを最終的な検索要求とするのでもよい。この処理を加えることにより、最近よく視聴されているコンテンツの感情、感情度を反映した検索・推薦結果を提示することができる。 Further, by the same method, the number of viewings in the past several hours may be recorded for each content, and processing for correcting the search request based on this may be performed. In this case, for example, with respect to the search request, a weighted average value with the emotions and emotional degrees of the top several frequently watched contents, for example, 10 is calculated, and this is calculated as the final search request. It may be. By adding this processing, it is possible to present a search / recommendation result that reflects the emotion and emotion level of content that is often viewed recently.

その他の方法としては、過去数時間において視聴されたコンテンツの感情、感情度の視聴順序パターンを分析し、その情報を基に検索要求を補正することもできる。例えば、図１４に示すように、あるユーザが、時間軸に沿って“楽しい”感情度の高いコンテンツ、“哀しい”感情度の高いコンテンツ、“楽しい”感情度の高いコンテンツ、の順序で視聴することが多いユーザである場合には、このユーザは、“楽しい”コンテンツを続けて視聴するよりは、“楽しい”ものと“哀しい”ものを交互に視聴することを好む、もしくはそのような気分であるユーザであると見做し、このパターンを検索要求に反映すればよい。 As another method, it is also possible to analyze the viewing order pattern of emotion and emotion level of content viewed in the past several hours and correct the search request based on the information. For example, as shown in FIG. 14, a certain user views content in a sequence of “fun” emotional content, “sad” emotional content, and “fun” emotional content along the time axis. If you are a frequent user, this user prefers, or feels like, to watch “fun” and “sad” alternately, rather than continue to watch “fun” content. This pattern may be reflected in the search request by assuming that the user is a certain user.

即ち、例えば、該ユーザが、現在“楽しい”コンテンツを視聴していれば、 “哀しい”感情度の高いコンテンツを推薦し、現在“哀しい”コンテンツを視聴していれば、“楽しい”コンテンツを推薦するように、検索要求を補正する。 That is, for example, if the user is currently watching “fun” content, recommend “sad” content with a high emotional level, and if the user is currently watching “sad” content, recommend “fun” content. So that the search request is corrected.

また、図９に示すように、部分コンテンツ毎に感情、感情度が計算され、感情、感情度の時間変化が取得できる場合には、これと対応付けるように、図１３に示すような検索要求入力インタフェースをユーザに提示し、感情毎に、感情度の時間変化を自由に描くことができるようにし、これを検索入力としてもよい。 Further, as shown in FIG. 9, when the emotion and the emotion level are calculated for each partial content and the temporal change of the emotion and the emotion level can be acquired, the search request input as shown in FIG. 13 is associated with this. The interface may be presented to the user so that the emotional change over time can be freely drawn for each emotion, and this may be used as a search input.

本発明の原理によれば、ステップＳ３００Ａ、Ｓ３００Ｂのうち、少なくとも１つのステップを実行することで、検索・推薦を実現することが可能であるが、これら双方を組み合わせて検索要求を決定し、入力することもできる。 According to the principle of the present invention, it is possible to realize search / recommendation by executing at least one of steps S300A and S300B. However, a search request is determined and input by combining both. You can also

この場合には、例えば、ステップＳ３００Ａ、Ｓ３００Ｂそれぞれによって得られた検索要求のＡＮＤもしくはＯＲを取ったものを最終的な検索要求としてもよい。また、感情度が検索要求に含まれる場合には、この重み付け平均値を計算し、これを最終的な検索要求としてもよい。 In this case, for example, an AND or OR of the search requests obtained in steps S300A and S300B may be used as the final search request. Further, when the emotion level is included in the search request, this weighted average value may be calculated and used as the final search request.

図２の検索要求受付部Ｆ３００（検索要求受付手段）は、上記ステップを通して入力された検索要求を受け付ける。 The search request receiving unit F300 (search request receiving means) in FIG. 2 receives a search request input through the above steps.

図２の類似度計算部Ｆ４００（類似度計算手段）は、検索要求受付部Ｆ３００が受け付けた検索要求と、データベース４００に蓄積されたコンテンツ又は部分コンテンツの感情及び感情度から、それらの類似度を計算する。以下、類似度計算部Ｆ４００が実行する処理である図１のステップＳ４００について説明する。 The similarity calculation unit F400 (similarity calculation means) in FIG. 2 calculates the similarity between the search request received by the search request reception unit F300 and the emotions and emotions of the content or partial content stored in the database 400. calculate. Hereinafter, step S400 of FIG. 1 that is a process executed by the similarity calculation unit F400 will be described.

ステップＳ４００では、ステップＳ３００で入力された検索要求と、ステップＳ２００において予め蓄積されたデータベース中のコンテンツに付与された感情と感情度を照らし合わせ、類似度を計算する。 In step S400, the similarity is calculated by comparing the search request input in step S300 with the emotions and emotions given to the contents in the database stored in advance in step S200.

ここで検索要求の、ｋ番目の感情カテゴリｅ^kの感情度をｒ（ｅ^k）、コンテンツ又は部分コンテンツに付与されたｋ番目の感情カテゴリｅ^kの感情度をｐ（ｅ^k）と表す。類似度ｆｓは、検索要求として入力された感情、感情度と、コンテンツ又は部分コンテンツに付与された感情、感情度との比較によって計算し、例えば、ユーザが選択した感情カテゴリのインデクス集合をＫ、その数を＃（Ｋ）とすれば、 Here the search request, indicating k th emotions of the emotion category e ^k r (e ^k), the k-th granted to the content or partial content emotional level of emotional category e ^k and p (e ^k). The similarity fs is calculated by comparing the emotion and emotion level input as a search request with the emotion and emotion level given to the content or partial content. For example, the index set of the emotion category selected by the user is K, If the number is # (K),

によって計算することができる。 Can be calculated by:

また、ユーザの検索要求として感情のみが選択され、感情度が入力されなかった場合には、 Also, if only emotion is selected as the user ’s search request and the emotion level is not entered,

によって計算することもできる。以上のような処理により、コンテンツ又は部分コンテンツ毎に検索要求に対する類似度を計算することができる。 Can also be calculated. Through the processing as described above, the similarity to the search request can be calculated for each content or partial content.

図２の結果提示部Ｆ５００（結果提示手段）は、類似度計算部Ｆ４００によって計算された類似度に基づいて、検索結果を生成し、ユーザ端末２００に結果を送信する。ユーザ端末２００は、この結果を受信し、モニタ２４１に提示する。以下、結果提示部Ｆ５００が実行する処理手順である図１のステップＳ５００について説明する。 The result presentation unit F500 (result presentation unit) in FIG. 2 generates a search result based on the similarity calculated by the similarity calculation unit F400, and transmits the result to the user terminal 200. The user terminal 200 receives this result and presents it on the monitor 241. Hereinafter, step S500 of FIG. 1 which is a processing procedure executed by the result presentation unit F500 will be described.

ステップＳ５００では、ステップＳ４００で計算された類似度に基づいて、検索結果を生成し、ユーザに提示する。提示の方法としては、各コンテンツの類似度の高い順に、コンテンツの属性情報、感情と感情度、サムネイル、要約などのうち少なくとも１つをリストして提示する。 In step S500, based on the similarity calculated in step S400, a search result is generated and presented to the user. As a presentation method, at least one of content attribute information, emotion and emotion level, thumbnail, summary, etc. is listed and presented in descending order of similarity of each content.

属性情報としては、コンテンツのタイトル、製作者、キーワード、概要、作成日時、フォーマット、コンテンツの存在するＵＲＬやパス、関連するコンテンツの属性情報などが考えられ、これらは、例えば、ＭＰＥＧ７など、ＸＭＬ（ｅＸｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｕｇｕａｇｅ）による記述形式に則っている場合などには付与することが可能である。 As the attribute information, content title, producer, keyword, outline, creation date / time, format, URL or path where the content exists, related content attribute information, and the like can be considered. For example, it can be given in accordance with the description format according to eExtensible Markup Language).

感情と感情度については、本発明のステップＳ１００において計算されているものをモニタ２４１に提示する。この提示の形式としては、例えば、感情と感情度をテキスト形式で表示するもの、図９に示したように、コンテンツを通して感情度がどのように変化するのかに関する時系列情報を提示するもの、図１１に示したような、各感情の感情度を軸に取ったグラフを用いて提示するものなどが挙げられる。 As for emotion and emotion level, those calculated in step S100 of the present invention are presented on the monitor 241. As a form of this presentation, for example, one that displays emotion and emotion level in text format, one that presents time-series information regarding how the emotion level changes through content, as shown in FIG. 11 and the like, which are presented using a graph with the emotion level of each emotion as an axis.

要約については、部分コンテンツであれば、それを提示してもよいし、コンテンツであれば、これに含まれる部分コンテンツのうち、検索要求に対して類似度の高い順に１つ以上の部分コンテンツを抽出し、これを要約として提示してもよい。何れの場合も、要約は、例えば、要約が提示されているモニタ２４１上の領域を、ユーザがポインティングデバイス２１２などを用いてポイントするなどの操作によって再生、視聴可能なものとする。 As for the summary, if it is a partial content, it may be presented, and if it is a content, one or more partial contents in the descending order of similarity to the search request among the partial contents included in the content may be presented. It may be extracted and presented as a summary. In any case, the summary can be reproduced and viewed by an operation such as the user pointing the area on the monitor 241 where the summary is presented, using the pointing device 212 or the like.

また、部分コンテンツ毎に感情度が推定されているため、ユーザが実際にコンテンツを視聴する際に、ユーザが視聴を希望しない部分を抑圧するように編集して再生することも可能である。例えば、ユーザが“怖い”感情の部分についての視聴を希望しない場合には、部分コンテンツのうち、“怖い”感情であると推定されている部分については、例えば、予め秒読みを行うなどによって通知する、映像にモザイクや暗転などの加工を施す、音量を下げるなどの編集を行うことができる。 In addition, since the emotion level is estimated for each partial content, when the user actually views the content, it is possible to edit and reproduce the content so as to suppress the portion that the user does not want to view. For example, when the user does not want to watch the “scary” emotion part, the part that is estimated to be the “scary” emotion in the partial content is notified by, for example, performing countdown in advance. It is possible to edit the video such as mosaicing and darkening, and reducing the volume.

この他、Ｒ指定やＰＧ指定されている制限コンテンツなどについて、制限の要因となっている感情を含む部分コンテンツを、同様に編集して視聴する、もしくはそのコンテンツ自体を再生しないようにすることも可能となる。この編集の適用・非適用を、ユーザ毎に変更可能とすることによって、例えば、子供に視聴させたくないコンテンツを、自動的に提示しないような設定を行うなどの利用ができる。 In addition, with respect to restricted content designated by R or PG, partial content including emotions that are the cause of restriction may be edited and viewed in the same manner, or the content itself may not be reproduced. It becomes possible. By making it possible to change the application / non-application of this editing for each user, for example, it is possible to make settings such as not automatically presenting content that a child does not want to view.

サムネイルについては、前述の要約のうち所定の時間箇所、例えば、要約映像の先頭の画像、中央の画像などを静止画として抽出し、提示する、といった方法がある。また、コンテンツ、又は部分コンテンツを通して、最も感情確率の高いフレームから、感情確率について降順に所定数抽出してもよい。 As for thumbnails, there is a method of extracting and presenting as a still image a predetermined time portion of the above-mentioned summary, for example, the top image, the center image, etc. of the summary video. Alternatively, a predetermined number of emotion probabilities may be extracted in descending order from the frame having the highest emotion probability through content or partial content.

以上、この発明によるコンテンツ検索・推薦方法の、実施形態における方法の１例について詳細に説明した。
[実施形態の第２例：音声信号データと映像信号データを用いたコンテンツ検索・推薦]
本発明の実施形態の第２例は、音声信号データに加え、映像信号データも用いてコンテンツの感情、感情度を推定する場合である。本発明の実施形態の第２例に係る処理の流れ、装置の具体的構成は、それぞれ図１のフロー図、図２のブロック図の範囲に示されている限り、本発明の実施形態の第１例と同じとしてよい。 The content search / recommendation method according to the present invention has been described in detail above as an example of the method in the embodiment.
[Second Example of Embodiment: Content Search / Recommendation Using Audio Signal Data and Video Signal Data]
The second example of the embodiment of the present invention is a case where the emotion and emotion level of content are estimated using video signal data in addition to audio signal data. As long as the flow of processing and the specific configuration of the apparatus according to the second example of the embodiment of the present invention are shown in the flowchart of FIG. 1 and the block diagram of FIG. It may be the same as one example.

実施形態の第１例との違いは、感情推定部Ｆ１００において、音声信号データのみではなく、映像信号データも用いて、感情確率を計算する点である。以下、本実施形態の第２例の感情推定部Ｆ１００によって実行される、ステップＳ１００について説明する。以降の処理の流れ、及び装置の具体的構成は、全て実施形態の第１例と同じとしてよい。 The difference from the first example of the embodiment is that the emotion estimation unit F100 calculates the emotion probability using not only the audio signal data but also the video signal data. Hereinafter, step S100 executed by the emotion estimation unit F100 of the second example of the present embodiment will be described. The subsequent processing flow and the specific configuration of the apparatus may all be the same as in the first example of the embodiment.

感情推定部Ｆ１００の構成を、図１５を用いて説明する。感情推定部Ｆ１００は、コンテンツに含まれる音声信号データから、分析フレーム毎に、基本周波数、基本周波数の時間変動特性、振幅のｒｍｓ値、振幅のｒｍｓ値の時間変動特性、パワー、パワーの時間変動特性、音声速度、音声速度の時間変動特性の少なくとも１つを音声特徴量として抽出する音声特徴量抽出部Ｆ１０１Ａと、映像信号データから、分析フレーム毎に、ショット長、色ヒストグラム、色ヒストグラムの時間変動特性、動きベクトルの少なくとも１つを映像特徴量として抽出する映像特徴量抽出部Ｆ１０１Ｂと、学習用音声信号データ、学習用映像信号データをそれぞれ用い、予め構成された統計モデルである、１つ以上の音声モデルと映像モデルによって、特徴量の出現する確率として感情確率を計算する感情確率計算部Ｆ１０２と、感情確率に基づいて、１つ以上の分析フレームを含む部分コンテンツの感情度を計算する感情度計算部Ｆ１０３と、感情確率に基づいて、１つ以上の分析フレームによって構成される部分コンテンツの感情を判定する感情判定部Ｆ１０４により構成する。感情度計算部Ｆ１０３及び感情判定部Ｆ１０４は、更に、コンテンツの感情及び感情度を、感情確率、又は部分コンテンツの感情度、又は部分コンテンツの感情と感情度に基づいてそれぞれ推定する。 The structure of the emotion estimation part F100 is demonstrated using FIG. The emotion estimation unit F100 determines, for each analysis frame, the fundamental frequency, the temporal frequency variation characteristic of the fundamental frequency, the rms value of the amplitude, the temporal variation characteristic of the rms value of the amplitude, power, and the temporal variation of the power from the audio signal data included in the content. A voice feature amount extraction unit F101A that extracts at least one of a characteristic, a voice speed, and a time variation characteristic of the voice speed as a voice feature amount, and a shot length, a color histogram, and a color histogram time for each analysis frame from the video signal data. A statistical model pre-configured using a video feature quantity extraction unit F101B that extracts at least one of a variation characteristic and a motion vector as a video feature quantity, and learning audio signal data and learning video signal data. Emotion probability calculation that calculates emotion probability as the probability of appearance of feature amount by the above audio model and video model F102, an emotion level calculation unit F103 that calculates an emotion level of a partial content including one or more analysis frames based on the emotion probability, and a partial content configured by one or more analysis frames based on the emotion probability It is comprised by the emotion determination part F104 which determines the emotion of. The emotion level calculation unit F103 and the emotion determination unit F104 further estimate the content emotion and emotion level based on the emotion probability, the partial content emotion level, or the partial content emotion and emotion level, respectively.

感情推定部Ｆ１００によって実行されるステップＳ１００は、本発明によって、実際にコンテンツ又は部分コンテンツの検索・推薦を行う前に、予め行っておく。 Step S100 executed by the emotion estimation unit F100 is performed in advance before the content or partial content is actually searched / recommended according to the present invention.

ステップＳ１００の処理の流れを説明するフロー図を図１６に示す。ステップＳ１１０は、感情、感情度を求めるために必要となる感情確率を計算するための音声モデル、映像モデルを構築するための処理であり、Ｓ１２０〜Ｓ１４０はコンテンツ、部分コンテンツの感情確率を計算するための処理である。また、Ｓ１５０は、コンテンツ、部分コンテンツの感情、感情度を推定するための処理である。 FIG. 16 is a flowchart illustrating the process flow of step S100. Step S110 is a process for constructing an audio model and a video model for calculating emotion probabilities necessary for obtaining emotions and emotion levels, and S120 to S140 calculate emotion probabilities for content and partial content. Process. S150 is processing for estimating the emotion and the emotion level of the content and the partial content.

まず予め、後に説明する手順の一例のように、ステップＳ１１０において、学習用音声信号データ、学習用映像信号データに基づいて、予め感情確率を計算するための１つ以上の音声モデル、映像モデルを獲得しておく。 First, as an example of a procedure to be described later, in step S110, one or more audio models and video models for calculating emotion probabilities in advance based on learning audio signal data and learning video signal data are obtained. Earn it.

ステップＳ１２０では、音声特徴量抽出部Ｆ１０１Ａが、取り込まれたコンテンツの音声信号データから、所望の音声特徴量として分析フレーム（以下、音声フレームと呼ぶ）毎に計算し、抽出する。また、映像特徴量抽出部Ｆ１０１Ｂが、取り込まれたコンテンツの映像信号データから、所望の映像特徴量として分析フレーム（以下、映像フレームと呼ぶ）毎に計算し、抽出する。音声特徴量としては、基本周波数、基本周波数の時間変動特性、振幅のｒｍｓ値、振幅のｒｍｓ値の時間変動特性、パワー、パワーの時間変動特性、音声速度、音声速度の時間変動特性のうち１つ以上の要素、映像特徴量としては、ショット長、色ヒストグラム、色ヒストグラムの時間変動特性、動きベクトルのうち１つ以上の要素で構成される。 In step S120, the audio feature amount extraction unit F101A calculates and extracts a desired audio feature amount for each analysis frame (hereinafter referred to as an audio frame) from the audio signal data of the captured content. The video feature quantity extraction unit F101B calculates and extracts a desired video feature quantity for each analysis frame (hereinafter referred to as a video frame) from the video signal data of the captured content. The voice feature amount is one of the basic frequency, the time variation characteristic of the fundamental frequency, the rms value of the amplitude, the time variation characteristic of the rms value of the amplitude, the power, the time variation characteristic of the power, the voice speed, and the time fluctuation characteristic of the voice speed. The one or more elements and the video feature amount are composed of one or more elements among a shot length, a color histogram, a time variation characteristic of the color histogram, and a motion vector.

ステップＳ１３０では、感情確率計算部Ｆ１０２が、ステップＳ１２０において計算された音声特徴量、映像特徴量に基づき、コンテンツ、部分コンテンツの感情において、音声フレーム、映像フレーム毎に、それぞれ音声特徴量、映像特徴量が出現する確率として、音声感情確率と映像感情確率を求める。この際、ステップＳ１１０において予め獲得された音声モデル、映像モデルを用いる。 In step S130, the emotion probability calculation unit F102, based on the audio feature amount and the video feature amount calculated in step S120, in the emotion of the content and the partial content, for each audio frame and video frame, the audio feature amount and the video feature, respectively. The voice emotion probability and the video emotion probability are obtained as the probability that the quantity appears. At this time, the audio model and video model acquired in advance in step S110 are used.

ステップＳ１４０では、ステップＳ１３０において、音声フレーム毎に計算した音声感情確率と、映像フレーム毎に計算した映像感情確率に基づいて、音声フレームと映像フレームを共通化したフレーム毎の感情確率を求める。 In step S140, based on the voice emotion probability calculated for each audio frame in step S130 and the video emotion probability calculated for each video frame, an emotion probability for each frame in which the audio frame and the video frame are shared is obtained.

ステップＳ１５０では、ステップＳ１４０で計算したコンテンツ、部分コンテンツのフレーム毎の感情確率に基づいて、感情度計算部Ｆ１０３及び感情判定部Ｆ１０４がそれぞれ、コンテンツ、部分コンテンツの感情及び感情度を推定する。 In step S150, based on the emotion probabilities for each frame of the content and partial content calculated in step S140, the emotion level calculation unit F103 and the emotion determination unit F104 estimate the emotion and emotion level of the content and partial content, respectively.

まず、図１６のステップＳ１２０では、取り込まれたコンテンツの音声信号データ、及び映像信号データから、それぞれ所望の音声特徴量、映像特徴量をフレーム毎に抽出する。音声特徴量の抽出については、実施形態の第１例と同様であるので、以下に、映像特徴量抽出方法の１例について説明する。 First, in step S120 in FIG. 16, desired audio feature amounts and video feature amounts are extracted for each frame from the audio signal data and video signal data of the captured content. Since the extraction of the audio feature amount is the same as that of the first example of the embodiment, an example of the video feature amount extraction method will be described below.

映像特徴量は、ショット長、色ヒストグラム、色ヒストグラムの時間変動特性、動きベクトル等を抽出するものとする。時間変動特性の例としては、例えば、フレーム間差分がある。
映像特徴量には、映像フレーム毎に抽出を行う。映像フレームとしては、例えば、１映像フレームの長さを３３ｍｓとし、次の映像フレームは現映像フレームに対して、例えば、３３ｍｓの時間シフトによって形成されるものとすればよい。 As the video feature amount, a shot length, a color histogram, a temporal variation characteristic of the color histogram, a motion vector, and the like are extracted. An example of the time variation characteristic is an inter-frame difference, for example.
The video feature amount is extracted for each video frame. As the video frame, for example, the length of one video frame may be 33 ms, and the next video frame may be formed by a time shift of, for example, 33 ms with respect to the current video frame.

ここで、ショット長、動きベクトル、色ヒストグラムなどの基本的な抽出方法は様々あるが、これらは公知であり、例えば、非特許文献７などに示されている方法を用いることができる。 Here, there are various basic extraction methods such as a shot length, a motion vector, and a color histogram, but these are known, and for example, a method shown in Non-Patent Document 7 can be used.

ショット長については、３３ｍｓの映像フレーム内で抽出することは事実上不可能であるので、例えば、対象としている映像フレームが含まれるショットの長さとして抽出すればよい。また、１つ以上のショットを含むある区間におけるショット長の平均値や最大値、最小値などを用いてもよい。 Since it is practically impossible to extract the shot length within a 33 ms video frame, for example, the shot length may be extracted as the length of a shot including the target video frame. Further, an average value, maximum value, minimum value, or the like of the shot length in a certain section including one or more shots may be used.

色ヒストグラムについては、例えば、次のように抽出する。 The color histogram is extracted as follows, for example.

映像フレーム中の画素毎に、色相（Ｈｕｅ）を抽出する。この色相は、例えば１１や２５６など、所定の数Ｑに量子化しておくことで、全画素がＱ個の量子のうち何れに該当するかを求めることができる。これを全画素に渡り実行し、量子毎の出現数を計数することにより、映像フレームの色相ヒストグラムが抽出できる。 A hue (Hue) is extracted for each pixel in the video frame. This hue is quantized to a predetermined number Q such as 11 or 256, for example, so that it can be determined which of Q quanta corresponds to all pixels. By executing this over all the pixels and counting the number of appearances for each quantum, the hue histogram of the video frame can be extracted.

また、ある特定の領域のみについてのヒストグラムを抽出してもよい。 Alternatively, a histogram for only a specific area may be extracted.

動きベクトルについては、例えば、オプティカルフローを計算することによって、Ｘ成分とＹ成分からなるベクトルとして抽出することができる。オプティカルフローの計算の方法としては、例えば、非特許文献８などを用いることができる。この他、例えば、映像フレーム毎にノルムを計算するのでもよいし、特許文献３に開示されている方法などを用いて、パン、チルト、ズームなどのカメラ操作を検出し、それぞれ個別に単位時間辺りの操作量などとして計量化するのでもよい。 The motion vector can be extracted as a vector composed of an X component and a Y component, for example, by calculating an optical flow. As a method for calculating the optical flow, for example, Non-Patent Document 8 can be used. In addition, for example, the norm may be calculated for each video frame, or camera operations such as panning, tilting, and zooming are detected using the method disclosed in Patent Document 3, and each unit time is individually detected. It may be quantified as the amount of operation around.

ステップＳ１３０では、ステップＳ１２０において抽出された各音声フレームの音声特徴量、映像フレームの映像特徴量と、ステップＳ１１０において予め構成しておいた１つ以上の音声モデル、映像モデルとを用いて、コンテンツ、部分コンテンツの感情における音声感情確率、映像感情確率がそれぞれ計算される。 In step S130, the content using the audio feature quantity and video feature quantity of each audio frame extracted in step S120 and one or more audio models and video models configured in advance in step S110 are used. Then, the voice emotion probability and the video emotion probability in the emotion of the partial content are respectively calculated.

ここではまず、統計モデルを構成するためのステップＳ１１０の処理の１例について説明する。音声モデルについては、実施形態の第１例と同様の方法によって獲得すればよい。以下では、映像モデルの獲得方法について説明する。 Here, first, an example of the process of step S110 for configuring a statistical model will be described. The speech model may be obtained by the same method as in the first example of the embodiment. Hereinafter, a video model acquisition method will be described.

映像モデルは、学習用映像信号データから、学習を行うことによって獲得する。学習用映像信号データは、コンテンツの映像信号データ同様、映像フレーム単位で映像特徴量が抽出されており、更に、人手によって、前述したような感情カテゴリとして定めた種類のラベルが付与されているものとする。この実施形態の第２例においては、映像信号データによって分類される感情カテゴリは、音声モデルが推定するための感情カテゴリと同一であるとする。 The video model is acquired by performing learning from the video signal data for learning. The learning video signal data, like the content video signal data, has video feature quantities extracted in units of video frames, and has been manually assigned the types of labels defined as emotion categories as described above. And In the second example of this embodiment, it is assumed that the emotion category classified by the video signal data is the same as the emotion category for the speech model to estimate.

これらの感情カテゴリと、各映像モデルを対応付けることで、感情カテゴリ毎に映像感情確率を計算するための映像モデルを獲得する。これらのモデルとしては、例えば、正規分布、混合正規分布、隠れマルコフモデル、一般化状態空間モデルなどを用いるのでもよい。好ましくは、感情の時間遷移をモデル化できる、隠れマルコフモデル、一般化状態空間モデルなどの時系列モデルを採用する。 By associating these emotion categories with each video model, a video model for calculating a video emotion probability is acquired for each emotion category. As these models, for example, a normal distribution, a mixed normal distribution, a hidden Markov model, a generalized state space model, or the like may be used. Preferably, a time series model such as a hidden Markov model or a generalized state space model that can model the temporal transition of emotion is employed.

これらの映像モデルのパラメータの推定方法は、例えば、最尤推定法や、ＥＭアルゴリズム、変分ベイズ法などが公知のものとして知られており、用いることができる。詳しくは非特許文献４、非特許文献５などを参照されたい。 As methods for estimating the parameters of these video models, for example, a maximum likelihood estimation method, an EM algorithm, a variational Bayes method, and the like are known and can be used. For details, refer to Non-Patent Document 4, Non-Patent Document 5, and the like.

ステップＳ１３０では、ステップＳ１１０で獲得した音声モデル、映像モデルに、ステップＳ１２０で抽出した、それぞれ分析対象となるコンテンツの音声フレーム毎の音声特徴量、映像フレーム毎の映像特徴量を入力することで、音声感情確率、映像感情確率を計算する。ステップＳ１１０において、感情カテゴリ毎に確率を計算することができるように音声モデル、映像モデルを構築したため、各音声モデルに音声特徴量を、各映像モデルに映像特徴量をそれぞれ入力することで、音声感情確率、映像感情確率を計算することができる。 In step S130, by inputting the audio feature value for each audio frame and the video feature value for each video frame extracted in step S120 to the audio model and video model acquired in step S110, Calculate voice emotion probability and video emotion probability. In step S110, since the audio model and the video model are constructed so that the probability can be calculated for each emotion category, the audio feature quantity is input to each audio model, and the video feature quantity is input to each video model. Emotion probability and video emotion probability can be calculated.

また、映像感情確率について、特許文献４に開示されている方法などによって、映像中の顔と判断される領域を検出し、更に、特許文献５に開示されている方法などによって、顔の表情を認識した結果を反映してもよい。 For the video emotion probability, a region determined to be a face in the video is detected by a method disclosed in Patent Document 4, and the facial expression is further detected by a method disclosed in Patent Document 5. The recognized result may be reflected.

この反映の仕方としては、例えば、顔の表情を認識した結果が、ある感情カテゴリに対応する場合には、その感情カテゴリの映像感情確率を増加させ、その他の感情カテゴリの映像感情確率を、確率の公理を満たすように減少させて規格化する方法を取ることができる。 For example, when the result of recognizing a facial expression corresponds to a certain emotion category, the video emotion probability of the emotion category is increased, and the video emotion probability of other emotion categories is It is possible to take a method of reducing and normalizing so as to satisfy the axiom of.

ここで、音声モデルによって計算された音声感情確率と、映像モデルによって計算された映像感情確率に基づいて、フレームを共通化し、１つの感情確率を計算する（ステップＳ１４０）方法の１例について説明する。 Here, an example of a method of sharing one frame and calculating one emotion probability based on the voice emotion probability calculated by the voice model and the video emotion probability calculated by the video model (step S140) will be described. .

例えば、図１７に示すように、音声フレーム長が５０ｍｓ、映像フレーム長が３３ｍｓとした場合、例えば、１つの音声フレームと重なる映像フレームのうち、最も長時間重なっている映像フレームの映像感情確率ｐＶを、その音声フレームの音声感情確率ｐＡに、感情カテゴリ毎に積算、もしくは、所定の重みを導入し、ｐＡとｐＶの重み付け平均などを計算することで、これを新たに感情確率ｐｆ_tとすればよい。 For example, as shown in FIG. 17, when the audio frame length is 50 ms and the video frame length is 33 ms, the video emotion probability pV of the video frame that overlaps for the longest time among video frames that overlap one audio frame, for example. and by the voice emotion probability pA of the audio frame, integrating each emotional category, or by introducing a predetermined weight, to calculate the like weighted average of pA and pV, which newly emotion probability pf _t That's fine.

また、その他の方法としては、ｐＡとｐＶのうち、大きい方の値をｐｆ_tとして採用してもよい。音声信号データが存在しない映像コンテンツの場合には、例えば、ｐＶを二乗するなどのスケーリング調整を行い、これをｐｆ_tとしてもよい。 Further, as another method, out of pA and pV, it may adopt a larger value as the pf _t. In the case of video content without audio signal data, for example, scaling adjustment such as squaring pV may be performed, and this may be set as pf _t .

これらの場合には、フレームは音声フレームに共通化される。 In these cases, the frame is shared with the audio frame.

次に、ステップＳ１５０では、共通化されたフレーム毎に計算された感情確率に基づいて、コンテンツ、部分コンテンツの感情、感情度を推定する。この処理は、実施形態の第１例のステップＳ１４０（図５、図７）と同様に実行すればよい。 Next, in step S150, the emotion and the emotion level of the content and the partial content are estimated based on the emotion probability calculated for each common frame. This process may be executed in the same manner as step S140 (FIGS. 5 and 7) of the first example of the embodiment.

本発明の実施形態の第１例では、部分コンテンツを生成するにあたり、連続する音声であると考えられる音声区間の集合は１つの区間としてまとめ、これを部分コンテンツとした。実施形態の第２例においても、この方法を採用してもよいが、映像特徴量として、ショット長を抽出しているが、このショットを部分コンテンツとするのでもよい。 In the first example of the embodiment of the present invention, when a partial content is generated, a set of audio sections that are considered to be continuous audio are collected as one section, and this is set as the partial content. This method may also be adopted in the second example of the embodiment, but the shot length is extracted as the video feature amount, but this shot may be used as the partial content.

以下、図１のステップＳ２００以降の処理の流れは、本発明の実施形態の第１例と同様に実行すればよい。 Hereinafter, the flow of processing after step S200 in FIG. 1 may be executed in the same manner as in the first example of the embodiment of the present invention.

その他、本発明の実施形態として示した１例以外のものであっても、本発明の原理に基づいて取りうる実施形態の範囲においては、適宜その実施形態に変化しうるものである。 Other than the example shown as the embodiment of the present invention, the embodiment can be appropriately changed within the scope of the embodiment that can be taken based on the principle of the present invention.

以下では、この発明によって所望のコンテンツ又は部分コンテンツの検索・推薦を行う具体的な実施例を示す。
（第１実施例）：ユーザ端末ＨＤＤ内コンテンツの検索・推薦
本実施例は、ユーザ端末内ＨＤＤ２３１に蓄積されたコンテンツ検索・推薦を行う例である。この実施例における本発明の具体的装置の構成の１例を図１８に示す。 Hereinafter, a specific embodiment for searching and recommending desired content or partial content according to the present invention will be described.
First Example: Search / Recommendation of Content in User Terminal HDD This example is an example in which content stored / recommended in the HDD 231 in the user terminal is searched. FIG. 18 shows an example of the configuration of a specific apparatus of the present invention in this embodiment.

この実施例では、情報制御部３００はユーザ端末２００に内蔵されており、ユーザ端末内のＣＰＵ２２１、ＲＯＭ２２２、ＲＡＭ２２３、ＨＤＤ２３１（図３）は、それぞれ情報制御部内のＣＰＵ３０１、ＲＯＭ３０２、ＲＡＭ３０３、ＨＤＤ３０４と同一のものとしてよい。 In this embodiment, the information control unit 300 is built in the user terminal 200, and the CPU 221, ROM 222, RAM 223, and HDD 231 (FIG. 3) in the user terminal are the same as the CPU 301, ROM 302, RAM 303, and HDD 304 in the information control unit, respectively. Good.

したがって以降本実施例の説明では、情報制御部内の装置に関する表記は対応するユーザ端末内の装置に関する表記を用いる。 Therefore, hereinafter, in the description of the present embodiment, the notation relating to the device in the information control unit uses the notation relating to the device in the corresponding user terminal.

事前処理として、ユーザ端末ＨＤＤ２３１内に蓄積されたコンテンツについて、感情推定部Ｆ１００が音声信号データを用いることによって感情及び感情度を推定し、コンテンツ蓄積部Ｆ２００が、この情報と共にコンテンツ又は部分コンテンツをＨＤＤ２３１内に蓄積する。以下、手順は以下の通りである。
[手順１]ユーザが図１９に示すような検索要求入力画面を立ち上げ、キーボード２１１、ポインティングデバイス２１２を用いて、感情を検索要求として入力する。例えば、「楽しい」、「かっこいい」と入力する。
[手順２]検索要求を検索要求受付部Ｆ３００が受け取り、類似度計算部Ｆ４００が、該検索要求とＨＤＤ２３１内に蓄積されたコンテンツ又は部分コンテンツの感情との類似度を、前記式（７）に従って計算する。
[手順３]結果提示部Ｆ５００が、各コンテンツの類似度を参照し、降順にランキングしてリストを生成する。更に、このランキング順に属性情報、感情、感情度、及び要約をモニタ２４１に提示する。
[手順４]ユーザが、キーボード２１１、ポインティングデバイス２１２を用いて、視聴したいコンテンツを選択する。
[手順５]ユーザ端末２００が、ユーザが選択したコンテンツをＨＤＤ２３１から読み出し、モニタ２４１に提示、再生する。
[手順６]過去に再生したコンテンツと、現在視聴しているコンテンツの感情度について、前記式（５）に従って計算した重み付け平均値を、検索要求受付部Ｆ３００が検索要求として受け取り、類似度計算部Ｆ４００が、該検索要求とＨＤＤ２３１内に蓄積されたコンテンツ又は部分コンテンツの感情との類似度を、前記式（６）に従って計算する。
[手順７]結果提示部Ｆ５００が、各コンテンツの類似度を参照し、類似度の降順にランキングしてリストを生成する。更に、このランキングの上位のものを所定の数、例えば３つ、その属性情報、及びサムネイルをモニタ２４１に提示する。
以降、ユーザが利用を終了するまで[手順４]〜[手順７]を繰返してもよいし、ユーザが新たな検索要求を入力してもよい。
（第２実施例）：Ｗｅｂ上コンテンツの検索・推薦
本実施例は、情報制御部３００を備えたサーバ装置５００に含まれるデータベース４００内に蓄積されたコンテンツを、広域通信網によって接続された各ユーザ端末２００ａ、２００ｂ、・・・から検索要求を入力することで検索・推薦を行う例である。特に本実施例では、インターネット通信によるＷｅｂ上コンテンツの検索・推薦を例として説明する。この実施例における本発明の具体的装置の構成の１例を図２１に示す。ユーザは、情報制御部３００を備えたサーバ装置５００によって供給される所定のサイトへアクセスを行い、このサイトを通じて検索要求を入力するものとする。 As pre-processing, the emotion estimation unit F100 estimates the emotion and emotion level by using the audio signal data for the content stored in the user terminal HDD 231, and the content storage unit F200 transmits the content or partial content together with this information to the HDD 231. Accumulate in. The procedure is as follows.
[Procedure 1] The user launches a search request input screen as shown in FIG. 19, and inputs emotion as a search request using the keyboard 211 and the pointing device 212. For example, enter “fun” or “cool”.
[Procedure 2] The search request receiving unit F300 receives the search request, and the similarity calculation unit F400 determines the similarity between the search request and the content stored in the HDD 231 or the emotion of the partial content according to the equation (7). calculate.
[Procedure 3] The result presentation unit F500 refers to the similarity of each content and ranks in descending order to generate a list. Further, the attribute information, emotion, emotion level, and summary are presented on the monitor 241 in the ranking order.
[Procedure 4] The user uses the keyboard 211 and the pointing device 212 to select content to be viewed.
[Procedure 5] The user terminal 200 reads the content selected by the user from the HDD 231 and presents and reproduces it on the monitor 241.
[Procedure 6] The search request accepting unit F300 receives the weighted average value calculated according to the equation (5) as the search request for the emotion level of the content reproduced in the past and the currently viewed content, and the similarity calculation unit The F400 calculates the similarity between the search request and the emotion of the content or partial content stored in the HDD 231 according to the equation (6).
[Procedure 7] The result presentation unit F500 refers to the similarity of each content, and ranks in descending order of similarity to generate a list. Further, a predetermined number, for example, three of the top rankings, their attribute information, and thumbnails are presented on the monitor 241.
Thereafter, [Procedure 4] to [Procedure 7] may be repeated until the user ends the use, or the user may input a new search request.
Second Embodiment: Search / Recommendation of Content on the Web In this embodiment, contents stored in the database 400 included in the server device 500 including the information control unit 300 are connected to each other by a wide area network. In this example, search / recommendation is performed by inputting a search request from the user terminals 200a, 200b,. In particular, in this embodiment, a description will be given of search and recommendation of content on the Web by Internet communication as an example. FIG. 21 shows an example of the configuration of a specific apparatus according to the present invention in this embodiment. It is assumed that the user accesses a predetermined site supplied by the server device 500 provided with the information control unit 300 and inputs a search request through this site.

事前処理として、データベース４００内に蓄積されたコンテンツについて、音声信号データと映像信号データから感情推定部Ｆ１００が感情及び感情度を推定し、コンテンツ蓄積部Ｆ２００が、この情報と共にコンテンツ又は部分コンテンツをデータベース４００内に蓄積する。以下、手順は以下の通りである。
[手順１]ユーザがキーボード２１１、ポインティングデバイス２１２を用いて操作を行い、所定のＷｅｂサイトへアクセスを行う。
[手順２]サーバ装置５００が、図１９のような検索要求入力画面をユーザ端末２００のモニタ２４１に提示する。
[手順３]ユーザが、ポインティングデバイス２１２の操作によって、各感情の感情度を調整することで、検索要求を入力する。
[手順４]検索要求を検索要求受付部Ｆ３００が受信し、類似度計算部Ｆ４００が、該検索要求とデータベース４００内に蓄積されたコンテンツ又は部分コンテンツの感情、感情度との類似度を、前記式（６）に従って計算する。
[手順５]結果提示部Ｆ５００が、各コンテンツの類似度を参照し、その降順に各コンテンツの属性情報、感情、感情度、及び要約などの情報を含むリストを生成し、ユーザ端末に配信する。
[手順６]ユーザ端末２００が、配信されたリストを、モニタ２４１に提示する。ユーザがキーボード２１１、ポインティングデバイス２１２を用いて、視聴したいコンテンツを選択する。
[手順７]サーバ装置５００は、ユーザが選択したコンテンツをデータベース４００から読み込み、ユーザ端末２００に配信する。
[手順８]ユーザ端末２００は、サーバ装置より配信されたコンテンツを受信し、モニタ２４１に提示、再生する。
[手順９]再生しているコンテンツの感情と感情度を検索要求受付部Ｆ３００が検索要求として受信し、類似度計算部Ｆ４００が、該検索要求とデータベース４００内に蓄積されたコンテンツ又は部分コンテンツの感情度との類似度を、前記式（６）に従って計算する。
[手順１０]結果提示部Ｆ５００が、各コンテンツの類似度を参照し、上位のものから所定の数、例えば３つ、各コンテンツの属性情報、感情、感情度、及び要約などの情報を含むリストを生成、ユーザ端末２００に配信する。
以降、ユーザが利用を終了するまで[手順６]〜[手順１０]を繰返してもよいし、ユーザが新たな検索要求を入力してもよい。
（第３実施例）：テキスト検索と併用したＷｅｂ上コンテンツの検索・推薦
本実施例は、第２実施例と同様、情報制御部３００を備えたサーバ装置５００に含まれるデータベース４００内に蓄積されたコンテンツを、インターネットによって接続された各ユーザ端末２００ａ、２００ｂ、・・・から検索要求を入力することで検索・推薦を行う例である。 As pre-processing, for content stored in the database 400, the emotion estimation unit F100 estimates the emotion and emotion level from the audio signal data and the video signal data, and the content storage unit F200 stores the content or partial content together with this information in the database. Accumulate in 400. The procedure is as follows.
[Procedure 1] A user operates the keyboard 211 and the pointing device 212 to access a predetermined Web site.
[Procedure 2] The server device 500 presents a search request input screen as shown in FIG. 19 on the monitor 241 of the user terminal 200.
[Procedure 3] The user inputs a search request by adjusting the emotion level of each emotion by operating the pointing device 212.
[Procedure 4] When the search request is received by the search request receiving unit F300, the similarity calculation unit F400 determines the similarity between the search request and the emotion or emotion of the content or partial content stored in the database 400. Calculate according to equation (6).
[Procedure 5] The result presentation unit F500 refers to the similarity of each content, generates a list including attribute information, emotion, emotion level, and summary of each content in descending order, and distributes the list to the user terminal. .
[Procedure 6] The user terminal 200 presents the distributed list on the monitor 241. A user uses the keyboard 211 and the pointing device 212 to select content to view.
[Procedure 7] The server device 500 reads the content selected by the user from the database 400 and distributes it to the user terminal 200.
[Procedure 8] The user terminal 200 receives the content distributed from the server device, and presents and reproduces it on the monitor 241.
[Procedure 9] The search request receiving unit F300 receives the emotion and the emotion level of the content being played back as a search request, and the similarity calculation unit F400 receives the search request and the content or partial content stored in the database 400. The similarity with the emotion level is calculated according to the equation (6).
[Procedure 10] List in which result presentation unit F500 refers to the similarity of each content, and includes a predetermined number, for example, three from the top, information such as attribute information, emotion, emotion level, and summary of each content Is generated and distributed to the user terminal 200.
Thereafter, [Procedure 6] to [Procedure 10] may be repeated until the user ends use, or the user may input a new search request.
(Third Embodiment): Search / Recommendation of Content on the Web Combined with Text Search This embodiment is stored in the database 400 included in the server device 500 including the information control unit 300, as in the second embodiment. In this example, search and recommendation are performed by inputting a search request from each of the user terminals 200a, 200b,.

特に本実施例では、本発明のよるコンテンツ検索・推薦装置に加え、更に、従来からコンテンツ検索方法として用いられている、検索要求としてコンテンツのタイトルや、製作者、ジャンル等のテキスト情報を入力し、この検索要求に基づいて、予めこれらの情報を属性情報として付与されたコンテンツのうち、一致する属性情報を持つコンテンツを検索する検索装置６００とを併用した場合の実施例である。 In particular, in this embodiment, in addition to the content search / recommendation apparatus according to the present invention, further, text information such as a title of a content, a producer, and a genre is input as a search request, which has been conventionally used as a content search method. This is an embodiment in the case where a search apparatus 600 that searches for contents having matching attribute information among contents previously given as attribute information based on this search request is used in combination.

この実施例における本発明の具体的装置の構成の１例を、図２２に示す。ユーザは、情報制御部３００を備えたサーバ装置５００によって供給される所定のサイトへアクセスを行い、このサイトを通じて検索要求を入力するものとする。 One example of the configuration of a specific apparatus of the present invention in this embodiment is shown in FIG. It is assumed that the user accesses a predetermined site supplied by the server device 500 provided with the information control unit 300 and inputs a search request through this site.

事前処理として、データベース４００内に蓄積されたコンテンツについて、音声信号データと映像信号データから感情推定部Ｆ１００が感情及び感情度を推定し、この情報と、更に、各コンテンツに予め付与されているタイトル、製作者、ジャンル等のテキスト情報及びそのコンテンツの周囲に記述されている周辺テキストから抽出したキーワード等を含めた属性情報を、コンテンツ蓄積部Ｆ２００が、この情報と共にコンテンツ又は部分コンテンツをデータベース４００内に蓄積する。以下、手順は以下の通りである。
[手順１]ユーザがキーボード２１１、ポインティングデバイス２１２を用いて操作を行い、所定のＷｅｂサイトへアクセスを行う。
[手順２]サーバ装置５００が、図１９、もしくは図２０のような検索要求入力画面を、ユーザ端末２００のモニタ２４１に提示する。
[手順３]ユーザが、視聴したいコンテンツのタイトル等をテキスト情報として検索画面に入力し、更に、ポインティングデバイス２１２の操作によって、各感情の感情度を調整することで、検索要求を入力する。
[手順４]検索装置６００が検索要求のうち、テキスト情報として入力されたタイトル等の情報に一致する属性情報が付与されたコンテンツをデータベース４００から検索し、候補リストを生成する。
[手順５]検索要求のうち、感情、感情度を検索要求受付部Ｆ３００が受信し、類似度計算部Ｆ４００が、該検索要求と手順４によって候補リストに含まれたコンテンツの感情、感情度との類似度を、前記式（６）に従って計算する。
[手順６]結果提示部Ｆ５００が、各コンテンツの類似度を参照し、その降順に各コンテンツの属性情報、感情、感情度、及び要約などの情報を含むリストを生成し、ユーザ端末に配信する。
[手順７]ユーザ端末２００が、配信されたリストを、モニタ２４１に提示する。ユーザがキーボード２１１、ポインティングデバイス２１２を用いて、視聴したいコンテンツを選択する。
[手順８]サーバ装置５００は、ユーザが選択したコンテンツをデータベース４００から読み込み、ユーザ端末２００に配信する。
[手順９]ユーザ端末２００は、サーバ装置より配信されたコンテンツを受信し、モニタ２４１に提示、再生する。
[手順１０]再生しているコンテンツの感情と感情度を検索要求受付部Ｆ３００が検索要求として受信し、類似度計算部Ｆ４００が、該検索要求とデータベース４００内に蓄積されたコンテンツ又は部分コンテンツの感情との類似度を、前記式（６）に従って計算する。
[手順１１]結果提示部Ｆ５００が、各コンテンツの類似度を参照し、上位のものから所定の数、例えば３つ、各コンテンツの属性情報、感情、感情度、及び要約などの情報を含むリストを生成、ユーザ端末２００に配信する。
以降、ユーザが利用を終了するまで[手順７]〜[手順１０]を繰返してもよいし、ユーザが新たな検索要求を入力してもよい。また、この実施例では、先にテキストによる情報に基づいて検索装置６００が検索を実行し、検索された候補リストのコンテンツを感情、感情度によって絞込み検索したが、逆に、先に感情、感情度によって候補リストを生成し、テキスト情報による絞込み検索を実行してもよい。 As pre-processing, the emotion estimation unit F100 estimates the emotion and the emotion level from the audio signal data and the video signal data for the content stored in the database 400, and this information and a title given to each content in advance. The content storage unit F200 stores the attribute information including the text information such as the producer and the genre and the keyword extracted from the surrounding text described in the surroundings of the content in the database 400 together with the information. To accumulate. The procedure is as follows.
[Procedure 1] A user operates the keyboard 211 and the pointing device 212 to access a predetermined Web site.
[Procedure 2] The server device 500 presents a search request input screen as shown in FIG. 19 or 20 on the monitor 241 of the user terminal 200.
[Procedure 3] The user inputs the title or the like of the content to be viewed as text information on the search screen, and further adjusts the emotion level of each emotion by operating the pointing device 212 to input a search request.
[Procedure 4] The search apparatus 600 searches the database 400 for content to which attribute information that matches information such as a title input as text information is included in the search request, and generates a candidate list.
[Procedure 5] Of the search requests, the search request receiving unit F300 receives the emotion and the emotion level, and the similarity calculation unit F400 determines the emotion and the emotion level of the content included in the candidate list by the search request and the procedure 4. Is calculated according to the equation (6).
[Procedure 6] The result presentation unit F500 refers to the similarity of each content, generates a list including information such as attribute information, emotion, emotion level, and summary of each content in descending order, and distributes the list to the user terminal. .
[Procedure 7] The user terminal 200 presents the distributed list on the monitor 241. A user uses the keyboard 211 and the pointing device 212 to select content to view.
[Procedure 8] The server device 500 reads the content selected by the user from the database 400 and distributes it to the user terminal 200.
[Procedure 9] The user terminal 200 receives the content distributed from the server device, and presents and reproduces it on the monitor 241.
[Procedure 10] The search request receiving unit F300 receives the emotion and the emotion level of the content being reproduced as a search request, and the similarity calculation unit F400 receives the search request and the content or partial content stored in the database 400. The degree of similarity with emotion is calculated according to the equation (6).
[Procedure 11] List in which result presentation unit F500 refers to the similarity of each content, and includes a predetermined number, for example, 3 from the top, information such as attribute information, emotion, emotion level, and summary of each content Is generated and distributed to the user terminal 200.
Thereafter, [Procedure 7] to [Procedure 10] may be repeated until the user ends use, or the user may input a new search request. Further, in this embodiment, the search device 600 first performs a search based on information based on text and searches the content of the searched candidate list by narrowing down the emotion and emotion level, but conversely, the emotion and emotion first. A candidate list may be generated depending on the degree, and a narrow search based on text information may be executed.

また前記コンテンツ検索・推薦方法をコンピュータに実行させるためのプログラムを構築するものである。 Also, a program for causing a computer to execute the content search / recommendation method is constructed.

また前記プログラムを記録した記録媒体を、システム、又は装置に供給し、そのシステム又は装置のＣＰＵ（ＭＰＵ）が記録媒体に格納されたプログラムを読み出し実行することも可能である。この場合記録媒体から読み出されたプログラム自体が上記実施形態の機能を実現することになり、このプログラムを記録した記録媒体としては、例えば、ＣＤ−ＲＯＭ，ＤＶＤ−ＲＯＭ，ＣＤ−Ｒ，ＣＤ−ＲＷ，ＭＯ及びＨＤＤ等がある。 It is also possible to supply a recording medium recording the program to a system or apparatus, and the CPU (MPU) of the system or apparatus reads and executes the program stored in the recording medium. In this case, the program itself read from the recording medium realizes the functions of the above-described embodiment, and examples of the recording medium on which the program is recorded include CD-ROM, DVD-ROM, CD-R, CD- There are RW, MO, and HDD.

本発明の実施形態における方法の処理の流れを説明するフロー図。The flowchart explaining the flow of a process of the method in embodiment of this invention. 本発明の実施形態における装置の構成を説明するブロック図。The block diagram explaining the structure of the apparatus in embodiment of this invention. 本発明の実施形態におけるユーザ端末の装置の構成を説明するブロック図。The block diagram explaining the structure of the apparatus of the user terminal in embodiment of this invention. 本発明の実施形態の第１例における感情推定部Ｆ１００の装置の構成を説明するブロック図。The block diagram explaining the structure of the apparatus of the emotion estimation part F100 in the 1st example of embodiment of this invention. 本発明の実施形態の第１例における感情推定部Ｆ１００が実行する処理のフロー図。The flowchart of the process which the emotion estimation part F100 in the 1st example of embodiment of this invention performs. 本発明の実施形態における音声特徴量の抽出を説明する図。The figure explaining extraction of the audio | voice feature-value in embodiment of this invention. 図５のステップＳ１４０の処理の流れを説明するフロー図。FIG. 6 is a flowchart for explaining the processing flow of step S140 in FIG. 5. 本発明の実施形態における部分コンテンツの感情度を説明する図。The figure explaining the emotion level of the partial content in embodiment of this invention. 本発明の実施形態における感情度の時系列情報の一例を示す図。The figure which shows an example of the time series information of the emotion degree in embodiment of this invention. 図１のステップＳ３００の処理の流れを説明するフロー図。The flowchart explaining the flow of a process of step S300 of FIG. 本発明の実施形態における感情カテゴリを軸に取ったグラフ（レーダーグラフ）を示す図。The figure which shows the graph (radar graph) centering on the emotion category in embodiment of this invention. 本発明の実施形態における感情カテゴリ毎の感情度を調整するイコライザ形のインタフェースを示す図。The figure which shows the equalizer-type interface which adjusts the emotion degree for every emotion category in embodiment of this invention. 本発明の実施形態における感情毎の感情度の時系列情報による検索要求入力インタフェースを示す図。The figure which shows the search request input interface by the time series information of the emotion degree for every emotion in embodiment of this invention. 本発明の実施形態におけるユーザの視聴履歴の１例を示す図。The figure which shows an example of the user's viewing history in embodiment of this invention. 本発明の実施形態の第２例における感情推定部Ｆ１００の装置の構成を説明するブロック図。The block diagram explaining the structure of the apparatus of the emotion estimation part F100 in the 2nd example of embodiment of this invention. 本発明の実施形態の第２例における感情推定部Ｆ１００が実行する処理のフロー図。The flowchart of the process which the emotion estimation part F100 in the 2nd example of embodiment of this invention performs. 本発明の実施形態における音声感情確率と映像感情確率から感情確率を計算する方法を説明する図。The figure explaining the method of calculating an emotion probability from the audio | voice emotion probability and image | video emotion probability in embodiment of this invention. 本発明の実施形態の第１実施例における装置の具体的な構成の１例を示すブロック図。The block diagram which shows an example of the specific structure of the apparatus in 1st Example of embodiment of this invention. 本発明の実施形態における検索要求入力画面の１例を示す図。The figure which shows an example of the search request input screen in embodiment of this invention. 本発明の実施形態における検索要求入力画面の１例を示す図。The figure which shows an example of the search request input screen in embodiment of this invention. 本発明の実施形態の第２実施例における装置の具体的な構成の１例を示すブロック図。The block diagram which shows an example of the specific structure of the apparatus in 2nd Example of embodiment of this invention. 本発明の実施形態の第３実施例における装置の具体的な構成の１例を示すブロック図。The block diagram which shows an example of the specific structure of the apparatus in 3rd Example of embodiment of this invention.

Explanation of symbols

Ｆ１００…感情推定部、Ｆ２００…コンテンツ蓄積部、Ｆ３００…検索要求受付部、Ｆ４００…類似度計算部、Ｆ５００…結果提示部、２００，２００ａ，２００ｂ…ユーザ端末、２１１…キーボード、２１２…ポインティングデバイス、２２１，３０１…ＣＰＵ、２２２，３０２…ＲＯＭ、２２３，３０３…ＲＡＭ、２３１，３０４…ＨＤＤ、２４１…モニタ、３００…情報制御部、４００…データベース、５００…サーバ装置、６００…検索装置。 F100 ... Emotion estimation unit, F200 ... Content storage unit, F300 ... Search request reception unit, F400 ... Similarity calculation unit, F500 ... Result presentation unit, 200, 200a, 200b ... User terminal, 211 ... Keyboard, 212 ... Pointing device, 221, 301 ... CPU, 222, 302 ... ROM, 223, 303 ... RAM, 231, 304 ... HDD, 241 ... monitor, 300 ... information control unit, 400 ... database, 500 ... server device, 600 ... search device.

Claims

An emotion estimation step in which the emotion estimation means estimates the emotion and the emotion level of the content from the audio signal data included in the multimedia content;
A content accumulation step in which content accumulation means accumulates content including the emotion estimated by the emotion estimation means and the emotion level as metadata;
A search request receiving means for receiving a search request corresponding to the emotion or the emotion and the emotion level,
A similarity calculation means for calculating the similarity of the content or partial content based on the search request;
A result presentation step in which a result presentation means presents a search / recommendation result of content or partial content based on the similarity; and
A content search / recommendation method characterized by including:

2. The method according to claim 1, wherein the emotion estimation step includes, for each analysis frame, a fundamental frequency, a temporal variation characteristic of the fundamental frequency, an rms value of the amplitude, a temporal variation characteristic of the amplitude rms value, a power, and a temporal variation of power. A feature amount extraction step for extracting at least one of a characteristic, a voice speed, and a time variation characteristic of the voice speed as a voice feature quantity;
By using one or more statistical models configured in advance using learning speech signal data, the appearance probability of the speech feature amount in the emotion and the transition probability in the time direction of one or more states corresponding to the emotion An emotion probability calculation step for calculating an emotion probability based on at least one of the two,
An emotion level calculating step of calculating the emotion level of the partial content including one or more of the analysis frames based on the emotion probability;
An emotion determination step of determining the emotion of the partial content configured by one or more of the analysis frames based on the emotion probability;
A content search / recommendation method characterized by including:

An emotion estimation step in which the emotion estimation means estimates the emotion and the emotion level of the content from the audio signal data and the video signal data included in the multimedia content;
A content accumulation step in which content accumulation means accumulates content including the emotion estimated by the emotion estimation means and the emotion level as metadata;
A search request receiving means for receiving a search request corresponding to the emotion or the emotion and the emotion level,
A similarity calculation means for calculating the similarity of the content or partial content based on the search request;
A result presentation step in which a result presentation means presents a search / recommendation result of content or partial content based on the similarity; and
A content search / recommendation method characterized by including:

4. The method according to claim 3, wherein the emotion estimation step includes, for each analysis frame, a fundamental frequency, a temporal variation characteristic of the fundamental frequency, an rms value of the amplitude, a temporal variation characteristic of the rms value of the amplitude, a temporal variation of power and power. A feature that extracts at least one of a shot length, a color histogram, a color histogram, a time variation characteristic of a color histogram, and a motion vector as a video feature amount from at least one of a characteristic, an audio speed, an audio speed time variation characteristic, and video signal data A quantity extraction step;
Appearance of the voice feature amount in the emotion by one or more statistical models pre-configured using the learning audio signal data and one or more statistical models pre-configured using the learning video signal data An emotion probability calculation step of calculating an emotion probability based on at least one of the probability and the transition probability in the time direction of one or more states corresponding to the emotion;
An emotion level calculating step of calculating the emotion level of the partial content including one or more of the analysis frames based on the emotion probability;
An emotion determination step of determining the emotion of the partial content configured by one or more of the analysis frames based on the emotion probability;
A content search / recommendation method characterized by including:

5. The method according to claim 1, wherein the search request reception step includes the emotion of the content or partial content that the user is viewing and / or viewing, or the emotion and the emotion level. The content search / recommendation method is characterized in that the search request determined based on the request is received.

6. The method according to claim 1, wherein the result presentation step ranks content or partial content based on the similarity, and attribute information of the content or partial content based on the ranking result. A content search / recommendation method characterized in that at least one of the emotion, the emotion level, a thumbnail, and summary content is listed and presented.

Emotion estimation means for estimating the emotion and emotion level of the content from the audio signal data included in the multimedia content;
Content accumulation means for accumulating content comprising the emotion and the emotion level estimated by the emotion estimation means as metadata;
Search request receiving means for receiving a search request corresponding to the emotion or the emotion and the emotion level;
Similarity calculation means for calculating the similarity of the content or partial content based on the search request;
Result presentation means for presenting search / recommendation results of content or partial content based on the similarity,
Content search / recommendation device characterized by including:

8. The apparatus according to claim 7, wherein the emotion estimation means includes, for each analysis frame, a fundamental frequency, a temporal variation characteristic of the fundamental frequency, an rms value of the amplitude, a temporal variation characteristic of the amplitude rms value, a power, and a temporal variation of power. A feature amount extraction means for extracting at least one of a characteristic, a voice speed, and a time variation characteristic of the voice speed as a voice feature quantity;
By using one or more statistical models configured in advance using learning speech signal data, the appearance probability of the speech feature amount in the emotion and the transition probability in the time direction of one or more states corresponding to the emotion An emotion probability calculation means for calculating an emotion probability based on at least one of them,
An emotion level calculating means for calculating the emotion level of the partial content including one or more of the analysis frames based on the emotion probability;
Emotion determination means for determining the emotion of the partial content constituted by one or more of the analysis frames based on the emotion probability;
Content search / recommendation device characterized by including:

Emotion estimation means for estimating the emotion and the emotion level of the content from the audio signal data and the video signal data included in the multimedia content;
Content accumulation means for accumulating content comprising the emotion and the emotion level estimated by the emotion estimation means as metadata;
Search request receiving means for receiving a search request corresponding to the emotion or the emotion and the emotion level;
Similarity calculation means for calculating the similarity of the content or partial content based on the search request;
Result presentation means for presenting search / recommendation results of content or partial content based on the similarity,
Content search / recommendation device characterized by including:

10. The apparatus according to claim 9, wherein the emotion estimation means includes, for each analysis frame, a fundamental frequency, a temporal variation characteristic of the fundamental frequency, an rms value of the amplitude, a temporal variation characteristic of the amplitude rms value, a power, and a temporal variation of power. A feature that extracts at least one of a shot length, a color histogram, a color histogram, a time variation characteristic of a color histogram, and a motion vector as a video feature amount from at least one of a characteristic, an audio speed, an audio speed time variation characteristic, and video signal data A quantity extraction means;
Appearance of the voice feature amount in the emotion by one or more statistical models pre-configured using the learning audio signal data and one or more statistical models pre-configured using the learning video signal data An emotion probability calculation means for calculating an emotion probability based on at least one of the probability and the transition probability in the time direction of one or more states corresponding to the emotion;
An emotion level calculating means for calculating the emotion level of the partial content including one or more of the analysis frames based on the emotion probability;
Emotion determination means for determining the emotion of the partial content constituted by one or more of the analysis frames based on the emotion probability;
Content search / recommendation device characterized by including:

11. The apparatus according to claim 7, wherein the search request receiving unit is the content or partial content that the user is watching and / or watching, or the emotion and the emotion level. The content search / recommendation device is characterized in that the search request determined based on the request is received.

12. The apparatus according to claim 7, wherein the result presentation unit ranks content or partial content based on the similarity, and attribute information of the content or partial content, the emotion based on the ranking result. A content search / recommendation device that presents at least one of the emotion level, thumbnail, and summary content in a list.

7. A content search / recommendation program, characterized in that a program for causing a computer to execute each step of the content search / recommendation method according to claim 1 is provided.